In this article, we'll take a closer look into designing access patterns using the single table strategy. This has been a pretty hot topic in the industry over the last few years because it positions your application to use NoSQL tables in a way that promotes cost-effective savings and fast read/write access. By identifying your access patterns, you will equip your application to perform at great speed and produce meaningful results with minimal overhead. Unlike traditional relational databases (RDBMS) systems, NoSQL tables are key-value databases built for performing extremely high volumes of transactions. They do not support relational queries. This means you won't find SQL JOINS when accessing a NoSQL Table. Joins are not typically supported because they come at an expensive cost. A trade-off comes with using NoSQL vs SQL table. Some popular choices of NoSQL tables are MongoDB and DynamoDB.
Amazon DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures and is offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB exposes a similar data model to and derives its name from Dynamo, but has a different underlying implementation. It is used in Financial Services, Marketing and Advertising, Retail, Higher Education, Hospital and Health, Telecom among other industries.
SQL Normalization:
One of the biggest mental transitions from RDBMS to NoSQL is normalization. Typically, with relational databases, we want to have a high degree of normalization. This means that we want to avoid duplicate data being present in any record, table or row. We also want to ensure our relational integrity using the concept of foreign keys and constraints. Using this ensures that our data state is accurate and avoids stale and orphaned records. Orphaned Records are those that do not have any parent or sibling relationship or are thus lost.
NoSQL Denormalization:
As mentioned before, using NoSQL comes with a trade off. We trade-off normalized data with speed to avoid slow I/O. The price we pay for quick retrieval and writes is normalization. While normalization is a goal, it does not scale very well. As a result, when designing tables using NoSQL, you should flip this concept on its head entirely. The more unstructured or denormalized the table, typically, the better. That is because we want to combine multiple tables into a single table even at the cost of duplicating records, or repeating rows in some cases. Obviously, we want to avoid this if it is possible and do this blindly. This is where the single table access pattern can help us understand the most valuable information to retrieve.
With the theory out of the way, let's take a look at a working example of taking a normalized set of relational tables and applying the concept of Single Table access. Let's take a simple use case scenario, a contact list. As shown below, each contact can have multiple phone numbers and multiple addresses.
Currently, on the Contact table, we have foreign Key (FK) which refers to a list of Addresses. How would we model this relationship using NoSQL?
The answer is...you guessed it...denormalization. We would take each record in the child table and combine it with the parent table. Thus, producing three records in this case.
With the denormalization out of the way, we now have a working table like this:
Since we don't have a foreign relationship, we can now drop the foreign keys. What we end up with at this point is a record repeated three times. The data is at a point in time where we can start to think about the different ways we would want to retrieve it.
The most important part of this process is to understand and identify the correct way to retrieve your data. Making this mistake is a costly one because typically NoSQL tables have extremely high amounts of data. Having to change your access pattern means, you need to re-index all of it. Alternatively, you could create a different way of querying your data but particularly with DynamoDB, every index added to a table, becomes a replica of that data. This means if you have two indexes, you are essentially using two copies of your data. Adding more GSI's does not always guarantee a fix because, at this time, you can only add up to 5 GSIs in DynamoDB. Because of these limitations, you should consider carefully and understand how that data is going to be retrieved. Let's review our example in more detail.
While in this scenario it may be obvious what choices should be made, be sure to take the time to apply this exercise, the choices may not be so obvious when it comes to query selection. Using the questioning method, we ask ourselves what data is most likely to be needed. This could come from requirements and acceptance criteria or our judgment.
Question's to ask:
For the sake of this example, we'll discuss the process of creating the primary and sorting key from the information listed above.
PK: Gustave-Eiffel
SK: streetTo combine several parts of information, we can make use of a delimiter to denote the different components of the key we want to retrieve based on what information is available. We decide to use the '#' symbol which has become a common practice.
SK: street#
SK: street#city#countryThat said, how do we deal with whitespaces or special characters?
PK: Gustave-Eiffel SK: 5th-Avenue#Paris#FranceWhat does this accomplish though? Why would this be useful?
PK: Gustave-Eiffel SK: 5th-AvenueThis would show all accounts that have a name with Gustave Eiffel that live on 5th Avenue. That said, this would also match individuals who live in other 5th Avenues around the world such as New York. Notice, we are using the same PK and SK here to retrieve different results based on the data we supply to the query. We accomplish this by using the SK with the PK to identify records we're interested in returning.
PK: Gustave-Eiffel SK: 5th-Avenue#Paris
PK: Gustave-Eiffel SK: 5th-Avenue#Paris#FranceThis query is the most specific because unlike the previous permutations, we are specifying the country which reduces the possibility of more than one match.
Another popular search method would be to use the phone number associated with the contact. Ultimately, it comes down to the application requirements that define the best way to identify access patterns. If we're building a Call Center application then you could argue we don't need the previous access pattern at all. If, however, we have a situation where the user is unable to locate a previous address in the system, we still need a way to retrieve the users information as an alternative.
We have the decision to make, we can either add a GSI or store the information differently. Let's use the flexibility of the PK and SK to accomplish this without introducing a new GSI.
To search by a user, we could use the phone number as the primary key in this context.
PK: 123456789This would automatically narrow our search results to a very specific set of individuals.
SK: name#street#city#countryHere is the current state of our design and the information associated with it.
If you've been keeping an eye on the denormalization process, you'll notice something else, our data is now unstructured vs a structured record set. This is another fundamental difference between the two types of databases. In RDMS, we are guaranteed to have attributes defined on a record with constraints whereas in NoSQL, if it is not part of the PK or Sk, nothing is guaranteed.
This does come with some advantages because with the naming conventions we've applied to our above table, as the name implies, we can persist different types of entities as a result of the very nature of unstructured data so long that it satisfies the PK and SK requirement.
Another important note, because we've defined our PK and SK as generic types, notice we did not define the PK as "Name", we can store different types of entities within the same table. We can even use different retrieval access patterns while still leveraging the same table. Again, we review the original questions we posed:
Because we've defined our Primary and Sorting Keys as generic types, we can reuse these columns to support storing more than one particular entity. We can repeat the same process for storing books as an example.
Doing this, we can define our Primary Key with the title of the book.
PK: Moby-DickNext, using the questioning method, we can identify the most valuable information from a book. Typically, the author or publishing company would rank highest, followed by the edition, copy format and language.
SK: author#edition#format#languageLet's take a look at our design with these different entities co-existing in the same table.
A few things to check out, each entity will make use of the metadata that applies to it. You can verify that book-type records do not always make use of an email attribute. For our contacts by phone object, it never uses the age group, although it has the flexibility to do so.
The key point here is that we did not change the primary key nor did we need to change the sorting key columns. Rather, we simply construct a key that fits the needs of our access pattern for a given entity.
Following the process, we ended up with these entity sk and pk mappings. We can now support over 10 queries without changing our table.
It is possible to query:
Hopefully, you can see how powerful this approach is and how quickly you can scale out a single table using this method. When volumes of data become a problem and you need fast access, this is a great way to have an excellent and performant table that can handle changes in a robust manner.
Now that we've thrown several variations of this, we're ready for implementation.
Ultimately, theneeds of your application should dictate the "right" way to access your data.