In the first two posts of this four-part series, you learned how the choice of zonal or Regional services impacts availability and some important characteristics of Amazon DynamoDB when used in a multi-Region context with global tables. Part 1 also covered the motivation for using multiple AWS Regions. Part 2 discussed some important characteristics of DynamoDB. In this post, you’ll review a design pattern for building a resilient multi-Region application using global tables.
How do you build a resilient multi-Region application using DynamoDB global tables? The post Disaster Recovery (DR) Architecture on AWS, Part II: Backup and Restore with Rapid Recovery lays out four patterns: backup and restore, pilot light, hot standby, and active-active. In this post, I focus on the active-active pattern, because it gives the best possible recovery time objective (RTO). However, keep in mind that this pattern is the most difficult to implement from architecture and operation perspectives. I discuss some of the challenges in this post and in Part 4.
In the active-active case, you deploy the application into multiple Regions, each of which serves some portion of the traffic. Clients can write into the Region of their choice and read from any Region. As Part 1 discussed, deploying across multiple Regions increases the theoretical availability of your application and gives better performance for a geographically distributed user base. There are additional costs to consider, which I discuss below.
In Part 1, I discussed a sample application (shown in Figure 1 that follows) that uses DynamoDB for persistence, AWS Lambda functions for business logic, and Amazon API Gateway to present the interface to external users.
The application depicted in Figure 2 that follows has three changes.
It uses DynamoDB global tables to automatically replicate data between two or more Regions.
Amazon Route 53 provides DNS-based routing.
The sample application uses latency-based routing policies for Route 53.
Latency-based routing is a popular choice when using an active-active pattern. Route 53 Application Recovery Controller (Route 53 ARC) provides highly reliable rebalancing of traffic.
This pattern is generally feasible under four conditions:
Your application above the database tier is stateless.
Your database allows writes in all Regions and handles data replication and write conflict resolution.
Note: If your application uses relational data stores or uses kubernetes for compute, the active-active pattern is more complex.
You have the operational expertise to handle multi-Region deployments, monitoring, and governance.
You have a way to direct traffic to a specific Region based on factors such as location and latency. Part 4 of this series talks about operational aspects of a multi-Region application.
Operationally, after you have active-active up and running, it’s the simpler pattern of the four, because you’re always using all Regions in normal conditions. You reduce the complexity of rebalance events and improve overall resource utilization. You do need to plan for how much extra capacity to have in each Region. For example, say you plan for an even 50/50 split between two Regions. You might slightly overprovision to account for traffic spikes, with each Region provisioned to handle 60 percent of traffic. From a cost perspective, you’re paying for 20 percent extra capacity. The concept of static stability would urge you to provision each Region to handle 100 percent of global traffic so that you avoid scaling actions during a rebalance, but that’s a cost/resilience trade-off decision you must make based on your organization’s needs and risk tolerance.
Now let’s look at some important design questions.
Does your application work well with DynamoDB global tables?
Recall from Part 2 that DynamoDB global tables don’t support transactions across Regions, and conditional operations might be looking at stale data if an update from another Region hasn’t yet fully replicated. The sample application uses both conditional operations and transactions. Is it appropriate to use global tables?
First, let’s revisit the specific logic that the sample application uses during reads and writes. During reads, it uses a DynamoDB transaction to read two items; an order and the accompanying product. If you’re concerned that you might read an invalid product, you can instead keep a historical snapshot of the state of a product at the time the order was placed. That’s probably a good change to make anyway, as the details of a product might change over time, and it’s useful to know what the product looked like when the order was placed.
During writes, the application uses a condition check to make sure that the order ID doesn’t already exist. If you use a universally unique ID (UUID) to create order IDs, that check is redundant. The application also uses a condition check to make sure that the customer ID exists. You might be able to make simplifying assumptions here that customer IDs are never deleted and that your application has already confirmed the existence of a valid customer ID earlier in the customer interaction. In a rare case where a customer ID is created in one Region and the customer immediately tries to place an order in another Region before their customer record is replicated, it might be ok to make the customer log in again before processing the order.
The last condition check ensures that the product has a valid status. While you want to make sure that items are in stock before trying to sell them, you might be able to handle a possible error condition (the item being out of stock when you try to fulfill the order) using a separate error flow. Perhaps a customer service agent can handle cases where an order is placed for an item that is actually out of stock.
Next, let’s consider a simplifying assumption that the product catalog, customers, and orders are geographically sharded. Perhaps we run the application in a Region in Europe for European customers, and in a Region in North America for customers in the United States, Mexico, and Canada. Using global tables makes the data available in both Regions for read-only reporting purposes and for resilience. If there’s an availability event in the Region in Europe, we can shift that traffic to the North America Region temporarily. But in normal operating conditions, the transactions and conditional operations in the North America Region aren’t working on the same items as the transactions and conditional operations in the Europe Region. In this case, using global tables is a good solution. This geographic homing of items in tables might occur naturally, or you can build business logic in the application to explicitly enforce it.
In another scenario, the application might have other business logic safeguards to make sure that operations on the same items don’t occur within the approximately 1 second replication latency window. Again, you must evaluate these assumptions and scenarios for your application and make sure you’re comfortable with the characteristics of global tables in a multi-Region scenario. The example application gives you a starting point to experiment with different scenarios.
How many Regions to use?
You can extend the design pattern discussed in the previous section to more than one Region. If you have users in North America, Europe, and Asia, for example, you might want to run the application as active-active in three Regions. The choice of how many Regions to include is a balance between providing lower-latency access to users and variable costs—such as DynamoDB storage costs—tied to using another Region. Each additional Region adds costs for replicated write units and storage. For more information, refer to DynamoDB pricing.
Picking a cell boundary
Cells are independent and identical partitions. The active-active design pattern uses a Region as a cell. If the application has Availability Zone-scoped resources, you could consider using a zone as a cell. But as the application stands, if you want a cell to have a scope smaller than a Region, you have to implement a cell router that assigns incoming requests to a specific cell, which might live beneath a separate API Gateway or just a different API stage. Besides improving resilience, using more cells can help ensure that each cell is running under a mostly constant load. As the article Reliability, constant work, and a good cup of coffee points out, constant work is a useful tool for improving resilience.
Another important decision is how users will find the right Regional application endpoint to use. AWS offers two services that provide this sort of global load balancing. The first is Route 53, which provides DNS-based routing according to policies based on latency, health checks, location, and other factors. The second is AWS Global Accelerator, which provides static IP addresses for client traffic. Global Accelerator onboards traffic that uses these IP addresses to the AWS global network and directs it to the nearest healthy Region. Compared to Route 53, Global Accelerator often provides faster redirection of customer traffic, as it doesn’t rely on DNS record updates. It might also provide lower latency because it uses the AWS global network as close to the end-user as possible. However, it incurs a higher cost than Route 53, with a per-GB traffic fee.
If you’re looking for the most reliable way to redirect traffic during an event that degrades the workload in one Region, consider using Route 53 ARC. Route 53 ARC gives you a traffic control mechanism hosted in the data plane of five Regions, so it continues to function even if two Regions suffer outages. You can manually or programmatically ask Route 53 ARC to isolate a failed Region from traffic. Route 53 ARC also follows a fail open policy, meaning that if all Regions are marked as unhealthy, traffic can go to any Region.
If you don’t use Route 53 ARC, make sure that you’re not relying on manual intervention to change Route 53 records or Global Accelerator routing policies.
Another option is using client-side endpoint switching. In client-side endpoint switching, each application client that invokes the application’s API endpoint is aware of all the Regional endpoints and contains logic about when to switch from one endpoint to another. For example, consider an active-active deployment of a three-tier application. The presentation tier is a stateless web application, which uses a service discovery mechanism such as AWS Cloud Map to discover all possible Regional endpoints for the application’s business logic tier. If the preferred Region isn’t available, the web application retries a few times before failing over to a different Region. Compared to managing rebalancing through Route 53 or Global Accelerator, client-side endpoint switching gives you more control over when to switch between Regions. However, it’s only feasible if each Region is equally capable of handling all traffic, it requires a service discovery mechanism and baking potentially complex endpoint switching logic into the application itself. Client-side endpoint switching doesn’t natively include the load balancing and advanced routing policies offered by Route 53 and Global Accelerator.
Figure 3 that follows summarizes your options for rebalancing traffic.
All three options push the routing decision high up in the stack. The underlying assumption is that you treat each Region as a fairly self-contained unit or cell. Although you could move the routing decision lower, that introduces more complexity to the design. For example, in the three-tier example, you could allow each layer of the application to decide how to connect to other layers. Reasoning about the effects of this design becomes complex. Is it okay if each tier of the application lives in a different Region? That’s likely not what you want, because it introduces additional latency and data transfer overhead, but it’s a possible effect of pushing routing decisions further down the stack.
Testing multi-Region applications
Let’s consider how to capture three important metrics from your application:
Recovery point objective (RPO) – With DynamoDB global tables version 2019.11.21, you don’t have a database metric that provides the pending replication item count. A useful approximation instead is the ReplicationLatency metric, which indicates how long it takes for items to replicate from a table in one Region to another. You can consider the ReplicationLatency metric as an upper bound on RPO. Consider an item written into one Region at time t1. If the average ReplicationLatency between that Region and a second Region is 1 second, you can expect on average that the item will be fully replicated to the second Region at time (t1 + 1 second). At some point t2 the first Region suffers availability impact, and t1 < t2 < (t1 + 1 second). Is the item replicated or not? You don’t know for sure. By the time t2, the item might have arrived in the second Region and still be processing in the inbound replication queue, in which case it’ll still be successfully replicated. Or, the item might have still been in the outbound replication queue in the source Region when the failure happened. Until DynamoDB provides more visibility into the state of pending replication items, you can take the ReplicationLatency metric as an upper bound on RPO. If you can’t afford to lose new items, you can use a replay queue as long as you work is idempotent. In other words, if you’re uncertain that an item was replicated successfully, you can try to replay it, as long as trying to update an existing item doesn’t change the actual state of the item. For more details on constructing idempotent APIs, refer to Making retries safe with idempotent APIs. If you’re thinking about putting in queues to replay work, refer to Avoiding insurmountable queue backlogs.
Recovery time objective (RTO) – RTO depends on which Region a user is connected to. Let’s say that us-west-2 experiences an event that degrades your workload. A user connected to us-east-1 notices no impact (RTO = 0). A user connected to us-west-2 will see failures until the traffic rebalancing system switches them to us-east-1. If you simulate a failure by throwing the rebalancing switch, the user in us-west-2 will see their traffic move to us-east-1 after the routing change takes effect. Their request latency might change, but they won’t see failures. Therefore, you can measure an approximation to RTO by observing when operations move from us-west-2 to us-east-1. Actual RTO might be complicated by factors like clients maintaining persistent connections and will vary based on how you handle connection routing.
Read and write latency – DynamoDB reports operation latency within the service’s routing of requests and you can also use a client-side custom metric to report latency from the database client’s perspective.
The Figure 4 that follows summarizes how we gather these metrics during a steady-state test. DynamoDB metrics help us measure RTO and RPO.
There is a test suite in the sample GitHub repository that uses an AWS Step Functions workflow that runs at the same time (but independently) in all Regions. The workflow generates a batch of simulated products, then invokes the application API to create and read back an order for each product. We set the concurrency in the workflow to 1, so we have a steady but slow incoming stream of traffic.
In one test run, the workflows started at approximately 20:20:00 UTC time. We used Route 53 ARC to turn off traffic to us-west-2 at 20:25:20.
Recovery time objective
Figure 5 that follows shows request counts in us-east-1 during the test. This metric is the number of recorded observations from the Lambda function in the application stack, recorded at 1-second granularity and shown in 10-second aggregation windows.
The Figure 6 that follows shows request counts for us-west-2.
The requests dropped off in us-west-2 and doubled in us-east-1 at approximately 20:26:00, for an RTO of 40 seconds. The Amazon CloudWatch logs show exactly when requests stopped routing to us-west-2.
Recovery Point Objective
Figure 7 that follows shows the replication latency (from us-east-1 to us-west-2) reported by DynamoDB for three tables during the test. The average latency was around 630 milliseconds, with some peaks during the early part of the test.
Read and write latency
Figure 8 that follows shows the read and write latency reported by DynamoDB and the application’s Lambda function in us-east-1. (We don’t have much data from us-west-2, because we directed traffic away from that Region.) The steady-state latency ranged from 5–20 milliseconds, as reported by DynamoDB, or 15–25 milliseconds as reported by the Lambda function. The latency experienced by end-users will depend on where they connect from, which we cover in more detail in Part 2.
In this post, you learned about a design pattern for building a resilient application using DynamoDB global tables. We reviewed two important design choices and presented some sample code for testing the design. In Part 4, we’ll cover operational concerns such as observability.
Special thanks to Todd Moore, Parker Bradshaw and Kurt Tometich who contributed to this post.
About the author
Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.
Read MoreAWS Database Blog