An Online Transaction Processing (OLTP) database is a key building block of a highly available application. Business-critical workloads can leverage Amazon Aurora database clusters in Multi-AZ configuration to help improve their overall uptime, and reduce the impact of availability-related events such as Multi-AZ failovers.
In this post, we discuss aspects of Amazon Aurora related to client connectivity and DNS endpoints in the context of high availability. We use failover scenarios to demonstrate connectivity issues that might impact client applications, and we provide recommendations for detecting and avoiding these issues. The examples contained in this post use Amazon Aurora MySQL-Compatible Edition, but the same general concepts apply to Amazon Aurora PostgreSQL-Compatible.
Client connectivity in Amazon Aurora
Customers connect to Amazon Aurora resources (database instances) using DNS endpoints. An Aurora DB cluster provides multiple types of endpoints, including a writer endpoint, a reader endpoint, instance endpoints, and optional customer-defined custom endpoints. These endpoints direct client connections to the appropriate instances within the cluster, such as the read/write-capable primary instance, and the read-only replicas. Consult the Aurora connection management documentation for more information.
Aurora automatically updates the DNS endpoints whenever needed, such as during a failover, or when DB instances are added, removed, or renamed. When an endpoint changes, it’s important for client applications to recognize the change as quickly as possible to avoid connectivity issues. Aurora endpoints use a 5-second Time to Live (TTL) setting, but the client environment configuration can introduce additional delays in the propagation of DNS changes. Failovers are a classic example of when such a delay is highly undesirable, because it prolongs application recovery and increases downtime.
The Amazon Aurora failover mechanism
The following diagram illustrates Aurora’s storage and compute architecture in a typical setup with a single primary instance and one or more replicas. In such a setup, the applications typically access Aurora database instances through two endpoints: a writer endpoint that points to the current primary instance (writer), and a read-only endpoint that includes all available replicas.
When a problem affects the primary instance, one of the reader instances takes over through a process called a failover. You can also initiate a manual failover through the RDS Management Console, or by using the FailoverDBCluster API.
When a failover happens, Aurora goes through the following steps:
Select one of the reader instances to take over as the new primary. The selection process is based on several factors such as the instance type, Availability Zone, and the customer-configured failover priorities. Consult the High availability for Amazon Aurora documentation section for details.
Restart the selected instance in read/write mode.
Restart the old primary node in read-only mode.
Update the writer endpoint, so that it points to the newly promoted instance.
Update the reader endpoint, so that it includes the restarted old primary instance(new reader instance).
Start-to-finish, a failover typically completes within 30 seconds. You can learn more about the sequence and timing of failover steps by visiting the Events tab in the RDS Management Console, or by using the DescribeEvents API. As an example, the following screenshot shows RDS events for an Aurora cluster aurora-test after a failover from aurora-test-instance-1 to aurora-test-instance-2.
In this example, the failover operation was initiated manually, which gives us the opportunity to measure client impact in a controlled manner. The test client’s operating system and networking setup don’t use DNS caching, and we put the following settings in the ~/.my.cnf file in order to make the test output more readable:
We run the shell command below to ping the writer endpoint every second, and report the instance role (writer or reader).
With the command running, we initiate a manual failover and observe the output:
According to the output, there was an approximately 10-second interruption in availability. During that time, the client kept trying to connect to the old writer, including a couple of seconds after that writer has already been demoted to a reader. This demonstrates our first challenge: the client has to wait for a DNS update before it recognizes a new writer.
Let’s now run the same test, but this time prevent the client from picking up the DNS change by hard-coding the primary’s IP in the client system’s /etc/hosts file. By doing so, we simulate a situation when the client caches the DNS entry indefinitely:
Now, run the same command again and observe the output:
Read MoreAWS Database Blog