Sunday, September 8, 2024
No menu items!
HomeDatabase ManagementAmazon DynamoDB schema from the prism of SQL

Amazon DynamoDB schema from the prism of SQL

A database is one of the main components of software system design. RDBMS has been a cornerstone of application development for decades, making SQL the language of choice for most developers. As application requirements for scalability and availability are changing rapidly, Amazon DynamoDBā€”a serverless, NoSQL key-value database that delivers single-digit millisecond performance at any scale provided the schema design is done rightā€”is among the databases that are taking center stage to meet those requirements because of features like:

Serverless and fully managed
No concept of maintenance, upgrades and downtime
Multi-master global replication etc.

In this post, I describe DynamoDB schema design from the perspective of SQL developers. I provide an example and explain the process backward from SQL to DynamoDB schema design.

Overview

Like any other database, schema design is the first step you should take when building an application with DynamoDB. One of the benefits of DynamoDB (and similar NoSQL databases) is support for flexible schema; nevertheless, schema design is still the most important aspect for driving efficient performance and addressing all access patterns. Schema design affects the performance of your application, which affects the cost. Lack of good schema design can lead to inefficient performance and greater total cost of ownership of the workload.

Schema design in DynamoDB requires a philosophical change in the data modeling mindset, especially if youā€™re coming from a relational database background. Itā€™s an iterative approach that might need multiple rounds of design and review. The following are typical steps of DynamoDB schema design:

Understand and define the use case.
Identify the entities and create an entity-relationship (ER) diagram.
Identify the access or query patterns:
Read and write queries.
Define query dimensions.
Understand aggregators.

Design the schema based on the access patterns identified in step 3.
Define the partition key and sort key for tables and global secondary indexes (GSI).
Review and repeat the process.

Steps 3, 4, and 5 are where the design exercise gets interesting and can sometimes be daunting. I explain these steps from the prism of SQL. I show you how to define the access patterns, convert them to SQL queries, and design a DynamoDB table and GSI schema with partition and sort keys with the help of those SQL queries. Note that the techniques mentioned here arenā€™t the only way to develop a schema and these steps wonā€™t map for all workloads.

I used NoSQL Workbench for DynamoDB to create the schema design in this post. NoSQL Workbench is a visual tool that provides data modeling, visualization, and query development capabilities to design and manage DynamoDB tables and GSIs.

When the schema design is complete, you can use PartiQLā€”a SQL-compatible query language for DynamoDBā€”to corroborate your SQL mappings. Iā€™ve included PartiQL queries for all the access patterns described in this post.

Defining the use case and access patterns

Letā€™s consider the example of an online retail store. You need to design customer, order, and invoice components for the application. A single customer can place multiple orders, and each order has an invoice associated with it. A single invoice has only one order, whereas one customer can have multiple invoices over time. The customer is identified by a customer ID; orders and invoices can be uniquely identified by order ID and invoice ID, respectively. This is depicted in the entity relationship diagram in Figure 1 that follows.

Figure 1: Entity relationship diagram

The following are common access patterns for a retail store:

Get all information about a customer.
Get all orders placed by a customer.
Get details of a given order.
Get specific order details of a customer.
Get all invoices for a customer.
Get orders for an invoice.

In the following sections, I take individual access patterns and write SQL queries for them. From the SQL queries, I design the DynamoDB schema.

Get all information about a customer

The get all information about a customer access pattern translates as: Given a customer ID, find the details about the individual customer. Information about a customer is stored in the customer entity. See the following SQL statement:

SELECT * FROM CUSTOMER WHERE CUSTOMER_ID = 12345;

To convert this into a DynamoDB schema, you need to define the primary key first, which consists of a partition key and, optionally, a sort key. Choosing the right partition key is paramount to DynamoDB schema design.

Look at the condition in the WHERE clause in the preceding SQL statement. It gives you a hint about how to choose the partition key. Because customer_id is in the WHERE clause with an equality condition, choose it as the partition key. Figure 2 that follows shows the initial schema with customer_id as the partition key. I kept the sort key blank, which I revisit shortly. I cannot leave it blank because a partition key and sort key are mandatory for a composite primary key.

Note: A sort key can be skipped for a simple primary key, which contains only the partition key.

Figure 2: Schema for access pattern get all information about a customer

Note: As a best practice, the names of the partition key and sort key are PK and SK rather than the attribute names themselves. The reason is that this table might contain multiple entities and the values of the primary key and sort key will be different for different entities, so itā€™s helpful to use generic names like PK and SK.

The corresponding DynamoDB API call (pseudo code) looks like:

GetItem API: Partition Key = 12345

The corresponding PartiQL statement looks like (with the table name retail_table):

SELECT * FROM ā€œretail_tableā€ WHERE ā€œPKā€ = ā€˜12345ā€™;

Get all orders placed by a customer

The requirement for the access pattern get all orders placed by a customer is to get all orders for a given customer ID. In a RDBMS, this requires joining the two entities: customer and order. Records from the order table are fetched by joining it with the customer table using a common attribute (customer_ID) thatā€™s present in both tables. See the following SQL statement:

SELECT A.* FROM ORDERS A, CUSTOMER B
WHERE A.CUSTOMER_ID = B.CUSTOMER_ID
AND B.CUSTOMER_ID = 12345;

DynamoDB is a non-relational NoSQL database that doesnā€™t support table joins (because table joins arenā€™t a scalable operation). To handle this, you must pre-join the data using denormalization. This lets you put customer and order details in the table so that there can be multiple orders with the same customer ID.

Youā€™ve already stored customer information in the table while designing the schema for the previous access pattern. Now, orders for the customer have to be added and linked to the customer. If you examine the WHERE clause, there are two equality conditions, and both use customer_id. Because customer_id is in the equality condition, choose it as the partition key and use the same table as before. The trick is to put order information along with the customer_id partition key, as shown in Figure 3 that follows.

Figure 3: First iteration of schema containing customer and order details

In Figure 3, there are two schemas. The first schema contains order attributes for the customer ID, while the second schema contains customer attributes. There are multiple problems with this design:

The primary key is the same for both the items, which returns duplicate errors while writing items to the table. Two items canā€™t have the same primary key.
Querying the table with customer_id as the partition key using the DynamoDB Query API gets information about both customer and order, which is not the requirement.
GetItem also returns an error because it fetches multiple items.

You need to find a way to separate the two schemas. This is where you would use the sort key. The order entity is part of the join and order_id is unique across orders, so put order_id in the sort key. You can get all orders for a customer using the Query API, where the partition key is customer_id.

Figure 4: Second iteration of schema with customer and order

Orders are sorted by the order IDs of a customer because that has been put in the sort key, as shown in the preceding Figure 4.

Modify the design for the customer entity as shown in the final iteration of the schema in Figure 5 that follows, because you have to fetch single customer details as required by the first access pattern. To fetch a single row for a customer, pass customer_id in the partition key and ROOT in the sort key. The customer details can now be easily demarcated from the other items with the same partition key. Note that ROOT isnā€™t an official value and you can use any term that suits your needs, such as META or ORIG.

Figure 5: Final iteration of schema with customer and order details

The corresponding DynamoDB API call looks like (pseudo-code):

Query API: Partition Key = 12345 and Sort Key begins_with(‘ORDR#’)

The corresponding PartiQL statement looks like:

SELECT * FROM “retail_table” WHERE “PK” = ‘12345’ AND BEGINS_WITH(“SK”, ‘ORDR#’);

Get details of a given order

Get details of a given order is as straightforward as the first access pattern: get all the details of an individual order based on the order ID. See the following SQL statement:

SELECT * FROM ORDER WHERE ORDER_ID = ā€œa1b2c3d4ā€;

Note that the equality condition is on order_id this time, so it can be placed in the partition key. But if you refer to the existing schema, order_id is already the sort key. If you use order_id to find the item, DynamoDB tries a table scan because it canā€™t find the partition, which should be avoided for performance and cost. Instead, you need to have another schema with order_id as the partition key. There are two ways you can do this:

Use order_id as the partition key of the base table with all the other details of the order.
Put the schema on a GSI with order_id as the partition key.

Putting the schema on a GSI is the better choice because you already have all the order details in place. Let me expand on the schema in the base table further, shown in Figure 6 that follows.

Figure 6: Schema with customer and order details

The sort key of the base table has to be there as a separate attribute, as was done with order_id in the preceding design (shown in Figure 6). Items that donā€™t have an order_id attribute wonā€™t be part of the GSI.

Project other attributes into the GSI to get the order details with order_id as the partition key and ROOT as sort key of the GSI. The schema on the GSI is as shown in Figure 7 that follows.

Figure 7: Schema for get details of a given order using GSI

The corresponding DynamoDB API call looks like (pseudo code):

Query API (on GSI): Partition Key = ā€œa1b2c3d4ā€ and Sort Key = ā€œROOTā€

Corresponding PartiQL statement looks like (with the GSI GSI):

SELECT * FROM ā€œretail_tableā€.ā€œGSIā€ WHERE ā€œGSIPKā€ = ā€˜12345ā€™ AND ā€œGSISKā€ = ā€˜ROOTā€™;

Get specific order details of a customer

In the get specific order details of a customer access pattern, you get the details of an individual order placed by an individual customer when the customer ID and order ID are given. See the following SQL statement:

SELECT A.* FROM ORDERS A, CUSTOMER B
WHERE A.CUSTOMER_ID = B.CUSTOMER_ID
AND B.CUSTOMER_ID = 12345
AND A.ORDER_ID = ā€œa1b2c3d4ā€;

The SQL query has an equality condition on two columns; the question is which you should use as the partition key. You can use either as the partition key and the other as a sort key, or concatenate both and place the result in the partition key (as long as the resulting primary key is unique). For example, <customer_id>#<order_id> or <order_id>#<customer_id>. But, looking at the schema, havenā€™t you already designed for this access pattern without any concatenation?

Figure 8: Schema with customer and order

The corresponding DynamoDB API call looks like (pseudo-code):

GetItem API: Partition Key = 12345 and Sort Key = ā€˜ORDR#a1b2c3d4ā€™

The corresponding PartiQL statement looks like:

SELECT * FROM ā€œretail_tableā€ WHERE ā€œPKā€ = ā€˜12345ā€™ AND “SK” = ā€˜ORDR#a1b2c3d4ā€™;

Had you not already solved the access pattern, you could have used the design mentioned in the preceding paragraph. The reason I used this query pattern is to show you that no matter the number of equality conditions in a SQL query, you can accommodate the pattern in the schema by placing concatenated values of those keys (for example <customer_id>#<order_id> or <order_id>#<customer_id>).

Get all invoices for a customer

The Get all invoices for a customer access pattern is similar to get all orders placed by a customer, but you have to use a different entity, which is Invoice. Given a customer ID, find all invoices belonging to that customer. See the following SQL statement:

SELECT A.* FROM INVOICE A, CUSTOMER B
WHERE A.CUSTOMER_ID = B.CUSTOMER_ID
AND B.CUSTOMER_ID = 12345;

The statement is similar to the access pattern to get the order history of a customer, where the equality condition is on customer_id and thereā€™s a join with entity Invoice. We use invoice_id to identify the invoices uniquely in the invoice entity. So, the schema design is as shown in Figure 9 that follows.

Figure 9: Schema with customer, order, and invoice

The corresponding DynamoDB API call looks like (pseudo-code):

Query API: Partition Key = 12345 and Sort Key begins_with(INVC#)

The corresponding PartiQL statement looks like (with the table name retail_table):

SELECT * FROM ā€œretail_tableā€ WHERE ā€œPKā€ = ā€˜12345ā€™ AND BEGINS_WITH(ā€œSKā€, ā€˜INVC#ā€™);

Get orders for an invoice

The get orders for an invoice access pattern provides one-to-one mapping between invoice and order. Given an invoice ID, fetch the details of its orders. See the following SQL statement:

SELECT A.* FROM ORDERS A, INVOICE B
WHERE A.INVOICE_ID = B.INVOICE_ID
AND B.INVOICE_ID = ā€œinv12ab34cdā€;

Get orders for an invoice is also similar to the access pattern to get all invoices for a customer, but the join is across the orders and invoices tables and the equality condition is on invoice_id. But you havenā€™t defined a partition key on invoice_id. To get orders for an invoice, add invoice entity in the main table and the necessary order details along with it. The schema design, after adding invoice_id as partition key, is as shown in figure 10 that follows.

Figure 10: Schema for access pattern get orders for an invoice

If further details of the order are required, use the GSI to fetch them using a separate API call.

The corresponding DynamoDB API call looks like (pseudo-code):

GetItem API: Partition Key = ā€œinv12ab34cdā€ and Sort Key = ā€œROOTā€

The corresponding PartiQL statement looks like:

SELECT * FROM ā€œretail_tableā€ WHERE ā€œPKā€ = ā€˜inv12ab34cdā€™ AND ā€œSKā€ = ā€˜ROOTā€™;

After you complete this schema design exercise, the items in the main table look something like Figure 11 that follows. For brevity, it shows an example of one customer with two orders and two invoices.

Figure 11: Final schema with all access patterns

The items in the GSI that contains order_id as the partition key looks like Figure 12 that follows.

Figure 12: Final GSI

Additional use of the sort key

As the name suggests, you use the sort key to sort records within a partition key. Letā€™s say thereā€™s an access pattern that replaces the previous access pattern that was used to get all orders placed by a customer. This new pattern gets a customerā€™s ten most recent orders. See the following SQL statement:

SELECT A.* FROM ORDERS A, CUSTOMER B
WHERE A.CUSTOMER_ID = B.CUSTOMER_ID
AND B.CUSTOMER_ID = 12345
ORDER BY ORDER_DATE LIMIT 10;

The design for get all orders for a customer (shown in the Figure 13 that follows) doesnā€™t work because the orders are sorted based on order ID and not on order date.

Figure 13: Schema for orders

Because order_date is included in the ORDER BY clause of the SELECT statement, you have to place order_date in the sort key so that the items are sorted based on that. The date should be stored as a string in ISO8601 format to get proper sorting. By doing so, each customerā€™s order items are sorted based on order_date. Order_id is still in the attributes but not used as a key. The new schema design is shown in Figure 14 that follows.

Figure 14: Alternate schema with order_date as the sort key

This schema helps solve both access patterns at once: get all orders placed by a customer and get a customerā€™s ten most recent orders.

The corresponding DynamoDB API call looks like (pseudo-code):

Query API: Partition Key = 12345 and Sort Key begins_with(ORDR#) Limit:10 ScanIndexForward:False

The corresponding PartiQL statement looks like:

SELECT * FROM ā€œretail_tableā€ WHERE ā€œPKā€ = ā€˜12345ā€™ AND BEGINS_WITH(ā€œSKā€, ā€˜ORDR#ā€™) ORDER BY SK ASC;

Note: LIMIT is not yet supported by PartiQL for DynamoDB.

Conclusion

In this post, you learned a way to use your existing knowledge of SQL development and translate that into DynamoDB schema design. You also learned some ways to map a SQL query to a DynamoDB table and GSI partition and sort keys. I showed you schema design using NoSQL workbench and corresponding SQL-like queries using PartiQL This approach can help you design your DynamoDB schema from the perspective of SQL.

About the Author

Soumyadeep Dey is a Database Specialist Solutions Architect with Amazon Web Services (AWS). He works with AWS customers to design scalable, performant, and robust database architectures on the cloud using both SQL and NoSQL database offerings.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments