Thursday, October 3, 2024
No menu items!
HomeDatabase ManagementHow a large financial AWS customer implemented high availability and fast disaster...

How a large financial AWS customer implemented high availability and fast disaster recovery for Amazon Aurora PostgreSQL using Global Database and Amazon RDS Proxy

In this post, we show how a large financial AWS customer achieved sub-minute failover between Availability Zones and single-digit minutes between AWS Regions. The customer partnered with AWS to engineer a solution to provide high availability (HA) and disaster recovery (DR) for their wealth management customer portal. The goals of the design were to minimize the time it takes to perform HA and DR failover between Availability Zones and Regions, and to reduce room for human error during failover. This required automating failure detection and failover and using AWS-managed data replication. The customer also wanted to make sure that the failover process itself is resilient against AWS control plane outages.

The team decided to use Amazon Aurora PostgreSQL-Compatible Edition with Amazon Aurora Global Database, which offers a fast, scalable, and AWS-managed cross-Region replication solution. The design includes canary outage detection provided by AWS Lambda, DNS redirection via Amazon Route 53, and control plane resilience via Amazon Route 53 Application Recovery Controller.

This design removed the need for human intervention which can takes minutes or even hours. The design reduced the cross-Region recovery time to single-digit minutes (2 minutes in our testing), and in-Region HA failovers from minutes to seconds (10 seconds in our testing). It also provided an extra layer of protection from control plane outages.

We have published the AWS CloudFormation templates that build the infrastructure and prepare the testing environment to our GitHub repository.

The design process

The platform is a classic three-tier design, consisting of a web frontend, a middle tier of application logic, and a backend database to store the application data, user session state, user preferences, and configuration for the application itself. Due to its criticality, every second of downtime means lost business, so the platform requires the minimal possible downtime and high availability. To maximize availability, the application needs to fail over in seconds if anything goes wrong with the primary Availability Zone, and in minutes in case of a large-scale events that make the entire Region unusable. The platform runs in US East (N. Virginia), with a backup instance in the US East (Ohio).

In 2021, the customer’s architecture team took on the ambitious goal to reduce the wealth management platform’s Recovery Time Objective (RTO) from tens of minutes to seconds when dealing with any software or large-scale event affecting the entire Region. They also set a Recovery Point Objective (RPO) of under a minute for user data, meaning that if the primary site went down, up to a minute of recent changes to the wealth management configuration could be lost, but anything older should be available at the backup site. The resilience requirements were as follows:

Less than a minute for in-Region failover
5-minute RTO and 15-minute RPO for cross-Region recovery
A reusable design that can be applied to other Tier 1 workloads

The customer partnered with the AWS Solutions Architecture team to design, build, and test this solution. The basic application (in a single Availability Zone and Region) would look like a standard two-tier or three-tier design, as shown in the following diagram.

Normally, AWS customers enact a Regional failover after alerting operations teams to manually evaluate the situation and decide how to proceed. Because this takes minutes (not seconds), the RTO for this case was far too stringent to allow this. This meant that the only path to success was full automation of data replication, failure detection, and failover.

Cross-Region data replication

Initially, the team looked at Amazon DynamoDB global tables to make the application’s data available in two Regions. DynamoDB global tables enable users to use a NoSQL API to write records to any Region and transparently synchronize them to other Regions within seconds, enabling multi-Region active-active operation. Unfortunately, this approach wasn’t appropriate for the wealth management application, because it requires a highly normalized relational schema.

Instead, the team chose an Aurora PostgreSQL global database. Aurora Global Database creates a read replica of the primary Region’s database in a secondary Region, and continuously replicates storage-level data changes to that Region. If your Aurora cluster experiences a failure or the Region experiences a large scale event, you can promote any reader Region to a writer Region. Aurora Global Database supports two different cross-Region failover actions: Switchover and Failover. A Switchover can be used for controlled scenarios, such as operational maintenance and other planned operational procedures. A Failover can be used to recover from an unplanned outage. A Switchover event waits for any pending writes to be copied to the secondary Region before failing over. Failover on the other hand begins the recovery process immediately, without waiting for replication, so is better suited to disaster recovery. Additionally, if the network connectivity between the two Regions is affected by the outage, this synchronization process can take an unpredictable amount of time, further delaying failover. To minimize downtime, the team chose the Failover approach.

In-Region high availability

Aurora provides storage auto-repair and automated recovery from some hardware failure, but the application still needs to take care of recovery in case the entire Aurora instance or Availability Zone becomes unavailable.

To further improve uptime, the team created an Aurora PostgreSQL replica in a second Availability Zone in each Region. These replicas are continuously and durably replicated from the writer instance, and are able to take over as the writer in seconds, allowing the workload to keep running in the primary Region, thanks to the design of the Aurora storage system.

To further improve high availability during in-Region failovers, the team used Amazon RDS Proxy, which is a Regional service that sits between the application and Aurora PostgreSQL cluster. RDS Proxy accepts and queues incoming requests during writer failover, so the SQL endpoint remains available even when the writer is not (or is intentionally taken down for maintenance), the new application connections see a few seconds of latency, and its pending queries continue running on the new writer in the secondary Availability Zone, with no service interruptions for the user.

The following diagram illustrates the architecture for this in-Region failover solution.

Application layer failover

In this architecture, each Availability Zone within each Region hosts an independent copy of the application layer, each pointing to the RDS Proxy (which in turn points to the Region’s Aurora PostgreSQL cluster’s writer instance). The application tier is stateless, with each client call getting its data from the database, so it doesn’t matter which application instance the client calls. This means that user requests can be routed to any of the application instances and produce the same results. To route the requests, the team created a Route 53 CNAME (DNS entry) that served as the DNS name for the application, globally. A CNAME can be configured to send traffic to Application Load Balancers (ALBs) in one or both Regions by specifying what percentage of requests to send to each. These weights can be updated in real time using the Route 53 API, which causes AWS DNS routing to start sending requests to the backup Region within seconds. In private hosted zones, Route 53 makes these updates near-instant, and for public hosted zones, Time to Live (TTL) can be set to a low single-digit number of seconds to make sure any stale DNS entries expire quickly. In this case, we used a private hosted zone and 1-second TTL.

The following diagram illustrates the architecture for a warm standby for a normal operation.

Now that we had a failover mechanism, it was time to figure out how to invoke it. Route 53 provides its own health checks that can do this, but these checks run every minute, which was too long for the customer’s RTO requirement. To solve this problem, the team implemented a canary Lambda (Python) function, which runs continuously, and sends test requests to the application as if it is a real user. These requests were designed to require every component of the application stack in order for the call to succeed, so success means that the application, database, proxy, and any connections between them are all working. The Lambda function runs these requests every 10 seconds to make sure any failure is detected quickly. However, because network connectivity can experience momentary interruptions, we didn’t want to treat a single failure as indication that the system is down. Instead, the Lambda function counts consecutive failures, and only considers a series of two or more (10 seconds apart) to be a true outage. To avoid the canary function itself being taken out by the same event that made the application unavailable, the team configured the function to run in the secondary Region. Some applications may choose to put the canary in a third Region, but we assumed that the secondary Region was healthy enough to run both the database and canary, and we also assumed that we were not planning for multi-Region failure. This had the added benefit of testing network connectivity from outside the Region to the application. In this case, we had a choice of either making the secondary Region as primary and reestablishing the canary there or waiting for the original primary Region to be available again and failing back to it.

The following diagram illustrates the architecture for a warm standby with a full Region failure.

We invoke the failover by restricting traffic to the primary database cluster using the Network Access Control List (NACL) of the subnet where the database is deployed, making it unreachable. When the canary Lambda function detects two consecutive failures, it initiates failover to the secondary Region. The function invokes an unplanned Aurora failover, which immediately cuts the secondary Region away from the primary Region and begins promoting the backup reader node as a writer.

The following diagram illustrates the architecture of a warm standby with a secondary Region where the Aurora writer starts.

After the Aurora database comes up as a writer in the secondary Region, the Lambda function updates the Route 53 CNAME weighting to redirect all new user traffic to the secondary Region. The process of removing and promoting the database cluster takes single-digit minutes to finish. At the same time, another Lambda function starts the process to register the database as the target of RDS Proxy, which takes several more minutes that we didn’t want to put into the critical path of bringing up the backup Region’s application. Until the proxy is up, the CNAME between the application and database sends traffic directly to Aurora.

The following diagram illustrates the architecture of a warm standby when the secondary Region proxy takes over.

When the RDS Proxy endpoint is available, the Lambda function updates the database CNAME to the proxy, so all new database connection attempts go to the proxy from then on.

To further enhance the resilience of this application, the team added the Route 53 Application Recovery Controller. We added another CNAME controller by Application Recovery Controller in front of the CNAME controller by Lambda. This was necessary to mitigate the risk of the Route 53 control plane (which hosts the API used to manipulate CNAME weights) being unavailable. Application Recovery Controller runs in five separate Regions, so it can tolerate the loss of any Region and still allow routing updates.

Other options considered

The team briefly considered inserting another CNAME between the application and RDS Proxy to enable only the database layer of the application to fail over to the other Region. Although this worked well, it introduced significant added complexity, including two additional CNAMEs to manage, additional health check Lambda functions, and failover weighing logic to maintain. In the end, the team decided this complexity was not warranted because failing over to the healthy Region was very fast, and the team decided it was very unlikely that they would want to continue running applications in a Region where databases in both Availability Zones suddenly became unavailable.

The team also considered making this solution active-active (sending traffic to both Regions simultaneously) and using AWS Database Migration Service (AWS DMS) to perform the asynchronous replication. This approach has the advantage of having half of the traffic already going to the healthy Region when a Regional failure occurs. The disadvantage is that the data being written to both Regions now has to be reconciled and some mechanism needs to be put in place to ensure no duplicates are created. In this case, the team was able to use odd and even primary keys to make sure the data could not conflict, but this is not the case for all applications. Additionally, because any user could theoretically be sent to any Region at any time by the Route 53 CNAME, there is also a chance of the user “racing” the AWS DMS replication of their writes to the other Region, getting there first and not seeing data they just wrote (because it had not yet been replicated). To avoid these conditions, any application that needs to guarantee that such race conditions don’t occur is encouraged to use 100/0 weighing (sending all users to the same Region unless that Region is down), whereas 50/50 weighing is appropriate for applications where data in each Region is totally independent and race conditions can be tolerated.

Testing results

We performed an in-Region failover testing scenario, where we called the failover_db_cluster API method that invokes the in-Region failover on the Aurora cluster, making it unavailable in the primary Availability Zone. In a real-life DB instance failure scenario, the event would be detected by Aurora, and failover would be invoked automatically. The in-Region failover process takes less than 30 seconds. We ran two versions of this test. For the first, we used RDS Proxy in front of database, and for the second test, the workloads connected to database directly, without RDS Proxy. During the first test, users saw errors only for a few seconds, because RDS Proxy buffered the incoming requests until the failover processes were complete, and then played them into the new, healthy database instance. During the second test, users saw errors for less than 1 minute, because workloads connected to the database directly, so the application threw errors throughout the failover process.

We also performed cross-Region failover testing as an unplanned failover, where the canary Lambda function in the backup Region will detach the secondary database cluster from the Aurora global database cluster, and promote it to be the primary cluster. The entire database failover and application DNS redirection took single-digit minutes to finish, making the application available to users again. We also saw that the RDS proxy took several additional minutes (high single digits) to start up, before traffic could be re-pointed to it, demonstrating that the two-step failover avoided this additional delay.

Enhancing the solution with Global database failover

In December 2022 Amazon RDS Proxy also introduced support for secondary regions. This allowed us to remove the CNAME redirection between the application and database, as well as the Lambda that orchestrates switching over this CNAME from the database to the RDS Proxy, because the proxy is now available even before failover. In addition to simplifying the failover, this also improved user experience, since the RDS Proxy in the failover region is able to accept database connections and queries even before the database failover is complete, preventing SQL errors that would have occurred during this time.

In August 2023 Amazon Aurora PostgreSQL introduced the managed unplanned failover feature for Amazon Aurora PostgreSQL. This feature eliminated the need to manually extend the Aurora GlobalDB cluster back into the original primary region after a failover. Instead, Aurora now continues to keep track of the instances in the failed region and automatically re-synchronize them with the failover region’s instances after the primary region becomes available again. This reduces the fail-back to the original primary region down to a single API call.

The following diagram illustrates the updated architecture for a warm standby during normal operation.

The following diagram illustrates the updated architecture for a warm standby during regional failover

Conclusion

In this post, we showed how a large financial customer partnered with AWS to engineer a solution to provide high availability and disaster recovery for the wealth management customer portal. In our tests, the high availability and disaster recovery architecture was able to achieve fully automated failover in less than 10 seconds between Availability Zones in a single Region, and less than 2 minutes in cross-Region failover scenarios. With continued AWS service enhancements, we were able to further simply the solution and improve use experience by moving to managed versions of its components.

About the Authors

Max Winter is a Senior Software Architect for AWS- Financial Services clients, with over 4 years with AWS and over 20 years of Wall Street technology experience. He works with global financial customers to design solutions that allow them to leverage the power of AWS services and cutting-edge generative AI to automate and optimize their business. In his free time, he loves hiking and biking with his family, music and theater, digital photography, 3D modeling, and imparting a love of science and reading to his two teenagers.

Adrian Tarjoianu is a seasoned technology professional with over 15 years of experience in the information technology industry. Currently, he works at AWS as a Senior Solutions Architect, supporting global financial services organizations. Outside of work, he enjoys hiking and traveling.

Carter Meyers is a Sr. Solutions Architect helping AWS customers across many industries migrate to and modernize on AWS. Carter has held numerous architectures, engineering, and IT leadership roles over the years, and is passionate about designing and building simple solutions to persistent problems.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments