Thursday, May 2, 2024
No menu items!
HomeDatabase ManagementIncrease visibility of performance and events on Amazon RDS and Amazon Aurora

Increase visibility of performance and events on Amazon RDS and Amazon Aurora

Customers often ask how they can improve visibility and monitorability of workload performance, as well as scheduled and unscheduled events, on Amazon Relational Database Service (Amazon RDS) and Amazon Aurora databases. This post provides insight into how to proactively enable and set the instrumentation so all details are captured and available for analysis.

In my 8 years of helping AWS customers achieve their goals with database usage in Amazon RDS and Aurora, I’ve worked with sysadmins that were interested in deeper operating system metrics like the ones top and vmstat provide on a self-managed database, and DBAs that wanted to know why the Amazon RDS Multi-AZ instance failed over to the standby availability zone. In this post, I outline various monitoring tools that provide the above information and more, as well as give some extra tips for each tool, and point out resources you can use to learn more about each of the tools, should you want a deeper explanation on their workings.

If you prefer watching a video, this AWS re:Invent 2022 session on performance monitoring with Amazon RDS and Aurora is a great option.

Amazon RDS and Aurora tools

We will begin with tools that are specific to Amazon RDS and Aurora. Later in the post, we discuss other AWS services that provide specific insights towards the usage of RDS and Aurora.

Performance Insights

Amazon RDS Performance Insights uses lightweight methods to capture database session and query performance metadata, and combines it with the instance’s CloudWatch metrics to provide a consolidated view in pre-configured and customizable dashboards. I particularly like that you can drill down and see which queries are running the longest and most frequently, down to the second. This metadata is stored outside of the database instance to minimize Performance Insights’ impact on your workload. I recommend enabling and keeping it enabled all the time, in all your production instances, so you have a history of database performance and can find how a certain query is performing now compared to before. You can find more information in Monitoring DB load with Performance Insights on Amazon RDS or check out this quick example of using Performance Insights to troubleshoot query performance.

Cool fact: Enabling Performance Insights or modifying its retention doesn’t cause downtime or interruption to your database. Additionally, the default data retention of 7 days has no additional cost.

Pro move: If you’re troubleshooting a performance incident, I recommend temporarily extending Performance Insights’ data retention so the performance data (before and during the incident) is preserved until the cause and solution are found.

Bonus for Amazon Aurora MySQL-Compatible Edition, Amazon Aurora PostgreSQL-Compatible Edition, and Amazon RDS for PostgreSQL instances: When running these database engines, Performance Insights can be extended with Amazon DevOps Guru for RDS, a machine learning (ML)-powered capability that detects database-related performance issues (such as resource over-utilization or misbehavior of certain SQL queries) and immediately notifies you, provides diagnostic information, and offers recommendations to help you resolve the issue. This video introduces and shows a demo of Amazon DevOps Guru for RDS.

Enhanced Monitoring

I still remember how excited I was in December 2015 with the announcement of the launch of the Amazon RDS Enhanced Monitoring feature. Before that, customers frequently asked for deeper information on the performance of RDS instances, like those that you get when running top, vmstat, and iostat on a Linux prompt. Enhanced Monitoring provides that, and more—some engines have more than 50 different metrics collected. Two of the Enhanced Monitoring’s features I particularly like are:

It writes the extracted metadata into Amazon CloudWatch Logs, where you can access the raw metric values interactively and programmatically, allowing for creative ways to analyze the metadata; for example, using third-party monitoring tools to pull the data and correlate with application performance. This also means that the metrics are not written in the RDS or Aurora instance itself, so the overhead to the database of enabling Enhanced Monitoring is minimal. Additionally, it’s worth noting that Enhanced Monitoring itself has no separate fee, but CloudWatch Logs charges the regular rates for Enhanced Monitoring’s data transfer and storage. Last but not least, the default retention of Enhanced Monitoring metadata is 30 days—but you can modify the log group retention in CloudWatch Logs itself.
You can define the data gathering interval Enhanced Monitoring uses to 1, 5, 10, 15, 30, or 60 seconds (60 is the default, 1 is the most detailed one). I particularly like the 1-second interval for two reasons: it matches the interval Performance Insights uses to gather database metadata, and it’s the best one to show very short-lived overloads (normally of CPU or I/O or networking) that lead to an increase in query latencies.

Cool fact: Enabling Enhanced Monitoring or modifying its gathering interval doesn’t cause downtime or interruption to your database.

Pro move: If you’re troubleshooting a performance incident, I recommend temporarily setting Enhanced Monitoring’s data gathering interval to 1 second until the performance profile of the workload accessing the database is well known. This is particularly important for databases serving internet-based applications where each user’s actions can directly trigger queries or modifications to data in the database. Many times, I saw by-the-minute metrics showing the CPU utilization to be far from 100%, only to set Enhanced Monitoring’s interval to 1 second and find out that waves of users would come in and cause overloads that lasted only a few seconds.

Want to know more about Enhanced Monitoring? The blog post Monitoring real-time Amazon RDS OS metrics with flexible granularity using Enhanced Monitoring has the details.

Engine main logs

My DBA colleagues will tell you that the database’s main log of engine activity is one of the primary sources of information for two kinds of events:

Whenever processes internal to the database are suspected—for example, checkpointing, redo logging, or data block flushing to disk. (You can search for keywords in the log, like “checkpoint”, “redo”, or “flush”.)
Whenever a crash, failover, shutdown, or startup happens. (The usual keywords are “start”, “shut”, “crash”, or “ready”.)

Amazon RDS and Aurora make these log files available by default to customers, so you can view or download log files from the instance. Each engine has its own naming convention—MySQL, MariaDB, and SQL Server call them error logs, Oracle calls those alert logs, and PostgreSQL calls them postgresql logs.

Tip: When it comes to engine logs, the saying “no news is good news” certainly applies. If there are no logs for a certain time interval, it means the engine didn’t deem necessary to warn of any issues—no shutdowns, startups, crashes, or failovers happened during the interval.

Cool fact: Amazon RDS and Aurora automatically manage log rotation and log retention for you.

Pro move: Each engine has parameters you can set to increase the amount of information logged. I recommend consulting the engine’s documentation to find out more about those.

Additional database activity auditing

In some database instances, depending on the criticality of your workload, you might want to keep a record of everything that happens on the database. Who logged in, from which IP address, at what time? What was the exact timestamp that table was dropped last Saturday, so you can initiate a precise point in time recovery procedure? Who updated the table with the settings of the application, and when? Database auditing answers these questions—but due to the overhead and the fact that not all customers need it, auditing is something you need to enable for your instances. Additionally, auditing facilities are provided by the database engines, so each one has different enablement procedures. For more information on how to enable auditing for each database engine on Amazon RDS and Aurora, refer to the following resources:

Configuring an audit log to capture database activities on RDS for MySQL, RDS for MariaDB, and Amazon Aurora with MySQL compatibility
How do I use the pgaudit extension to audit my Amazon RDS DB instance that is running PostgreSQL?
Audit Aurora PostgreSQL databases using Database Activity Streams and pgAudit
Security auditing in Amazon RDS for Oracle
Auditing Amazon RDS for SQL Server DB instances

Pro move: A very active instance will generate a large volume of logs—if your workload has this characteristic, check the next topic.

Log querying and retention with CloudWatch Logs

Besides each engine’s main log files and audit log files, some engines have even other forms of logging that might be useful—for example, MySQL and MySQL-compatible engines can log queries that take longer than a defined number of seconds through the slow query log. In between all these different forms of logging, the volume of files and their size can grow. Database instances, however, are not the optimal place to keep or access those log files. Because of this, Amazon RDS and Aurora have a feature that allows for the publishing of database logs to CloudWatch Logs. Here are some good reasons to export database logs into CloudWatch Logs:

To preserve disk space, Amazon RDS and Aurora regularly rotate and delete logs from the instance. On the other hand, logs in CloudWatch Logs can have their retention defined by you, and you can even set those to never expire.
Downloading large files or a large quantity of files directly from the database instance can substantially use the instance’s I/O resources. However, reading them from CloudWatch Logs has little to no impact on the performance of the database instance.
CloudWatch Logs has really nice searching and filtering capabilities, which make it faster and straightforward to pinpoint and access the logs you need.
CloudWatch Logs allows for setting up notifications when specific keywords are found in the logs.
Aurora instances store logs in the temporary disks attached to the instance—and in case of instance failure and replacement, the logs can be lost. But by publishing the logs, all but the last few will be preserved.

A necessary reminder: Most logs need to be enabled by you, and therefore, just setting the instance to export logs to CloudWatch Logs is not enough if they are not being generated in the first place.

Pro move: You can build proactive and automated RDS log analysis and alerting capabilities using CloudWatch Logs together with AWS Lambda, Amazon SNS, and other AWS solutions.

RDS events

This is one of the least-known and under-utilized features of Amazon RDS, and at the same time, one that is so useful! The same way that it’s important to monitor the error log file to know what is happening inside of the database, it’s equally important to monitor RDS events to know what is happening on the RDS infrastructure that supports the database.

Cool fact 1: The event “Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered.” Means that the instance’s hardware is being recovered. Recovery in Amazon RDS is done by replacing the hardware through moving to a new Amazon Elastic Compute Cloud (Amazon EC2) host. Recovery time will vary for two reasons: First, provisioning a new host takes some time. Second, after the new host is in place and the database process is brought up, the process will begin crash recovery, which can be quick or lengthy depending on how much was happening before the recovery. I recommend checking the database engine’s manual on crash recovery because each engine is different.

Cool fact 2: The event “The RDS Multi-AZ primary instance is busy and unresponsive” is one of the few possible reasons for RDS Multi-AZ failover that you can get notified of right away. If you see this event, you need to double down on the other monitoring tools mentioned in this post because those will help you find out why the instance became so busy that it went unresponsive.

Among others, Amazon RDS generates events for instances and clusters. You can access the last 24 hours on the Amazon RDS console, and the last 14 days when accessing programmatically, such as with the AWS Command Line Interface (AWS CLI) or SDK. But what I really recommend is that you enable an event notification subscription, which allows you to define where those get sent to and their storage retention.

Pro move: Set event notification subscriptions for all events of all clusters and all instances (no need to name individual ones, so it acts on the ones you create later) and set those to deliver to a dedicated email or other longer-term retention storage that you can search and read, even after the 14-day retention period in Amazon RDS.

Amazon RDS CloudWatch metrics

Every RDS and Aurora instance collects and publishes dozens of CloudWatch metrics out of the box, with 1-minute periods, and at no additional cost. The Amazon RDS console provides a nifty interface to quickly glance over the current status of an instance, but to conduct analysis and correlation of metrics, I prefer the more powerful, metric management purpose-built interface of CloudWatch.

Cool fact: RDS metrics are stored for 15 months, but the smaller periods are periodically aggregated by CloudWatch. For example, after 15 days, we can’t see the 1-minute data anymore, but we can still see the same metrics at their 5-minute period. And, because the five data points for each period were aggregated, we can see their minimum, average, and maximum.

Tip: Amazon RDS stores the metrics in CloudWatch using the instance’s name. This leads to two effects: First, if an instance is renamed, it might seem its metrics were erased, but in fact the metrics are tied to the old name, and can be reached out through the CloudWatch console. Second, if an instance is deleted and another one is created (or restored) with the same name, the new one appears to have metrics that predate its creation, because it is showing the previous instance’s metrics too.

Bonus for Aurora: Besides publishing instance metrics, Aurora also publishes the same metric values to the cluster name, and depending on whether the instance is a writer or a reader, to the WRITER role or the READER role. Note the following:

Cluster metrics are very useful for setting cluster-wide alarms that will work regardless of deleting or adding more instances, like setting an alarm if any instance’s CPU utilization is above 90%. They’re also useful to know, at a point in time, how many healthy instances a cluster had, and we can use the count statistic to know how many instances reported a metric each minute.
Metrics on the cluster’s WRITER role are a great way to consistently view past writer performance of a cluster, even when failovers have made two or more instances the cluster’s writer at different points in time. Only one instance reports to this role at a time.
Metrics on the cluster’s READER role are the ones used by Application Auto Scaling to add or remove readers to the cluster as needed. However, these metrics are only populated when at least one instance on the cluster is a reader. That’s why an Auto Scaling-enabled cluster must always have a writer and a reader.

Additionally, you can create CloudWatch alarms on these metrics, although you’ll probably want to do it only for the important ones, which vary by the engine being used. For example, on PostgreSQL, you’d set one to prevent transaction ID wraparound.

Pro tip: The lack of data for a certain metric (or all metrics for an instance) is just as telling as a high value for a metric would be. Because the RDS agent within the database’s instance is what pushes the metric values to CloudWatch, the lack of all metrics for an instance signals a possible overload or impairment of the instance. Similarly, if some of the metrics populate but not others, it can signify something else. For example, if I could see values for CPU utilization (and other OS-based metrics), but not for replica lag (or other database-based metrics), I would be suspicious and seek other means of confirming whether the database process itself is impaired in an otherwise healthy host.

AWS services

In this section, we discuss how other AWS services provide specific insights that complement Amazon RDS and Aurora’s.

AWS CloudTrail

Similar to how auditing allows you to log and know about the commands submitted to your database through database connections, AWS CloudTrail allows you to log and know about all the API calls submitted to your AWS account through various means (including the CloudTrail console, AWS CLI, CDK, third-party tooling, and direct-to-endpoint API calls) for AWS services, including Amazon RDS and Aurora.

Cool fact: CloudTrail has by default an event history that keeps the past 90 days of API calls stored—no need to enable or set anything to have this event history.

Pro move: You can enable an optional trail, which will deliver and store the events in an Amazon Simple Storage Service (Amazon S3) bucket, where you can set the retention as per your needs.

AWS Health

AWS Health is how AWS lets customers know about the performance and availability of AWS services and how these affect your accounts. There are two main parts: the Service health page, which contains public information and doesn’t require authentication to access, and the Your account health page, which provides information that is specific to your account’s resources.

Pro move 1: In addition to information about when things are not happening as they should, in the Your account health page, for RDS and Aurora instances and clusters, you’ll also find notifications of upcoming maintenance and necessary upgrades. Don’t forget to look at the Other notifications tab of Your account health page, which is where you’ll find these.

Pro move 2: The email address associated with the AWS account also receives emails about the health events. I recommend that these be stored for a longer term in a folder where those can be searched.

Amazon VPC

Amazon Virtual Private Cloud (Amazon VPC) has a couple features that help troubleshoot connectivity issues.

VPC Flow Logs

VPC Flow Logs provides information about the IP traffic (and TCP connections) going to and from network interfaces in your VPC.

Pro move: Use On-Demand to know where connections are coming from, and at which volume or rate. This is very helpful when new connection storms or spikes are suspected.

VPC Traffic Mirroring

Another frequent question I hear from customers is, “Can I do a packet capture on an RDS or Aurora instance?” to which the answer is, “Yes, and it’s simpler than you think,” because you can do it by yourself, and without even touching the RDS or Aurora instance. VPC Traffic Mirroring has the ability to mirror traffic from RDS or Aurora ENIs (same as any other ENIs in the VPC) to an EC2 instance where a packet capturing utility (like pcap or Wireshark) is running. This is the preferred way to troubleshoot suspected network and connectivity issues.

Cool fact: This is significantly more straightforward to set up than on clients because it captures traffic from multiple clients at once. It’s also faster because you can set it up independently of having already engaged AWS Support or not.

Pro move: Remember to set up the target EC2 instance to be at least the same size as the RDS or Aurora instance to make sure it has capacity to receive all of the packets being mirrored.

On your application or database client machine

In addition to all the aforementioned information that you can retrieve from AWS, from Amazon RDS, and from the database instance itself, there are important bits of information that are available only on the client machines—the ones connecting to the database. Errors and warnings that are session-specific, including connectivity issues, are reported only through the connection itself, so the client software opening the connection needs to log that information—otherwise, it’s lost.

Similarly, customers often suspect performance issues on the database, only to find out that the client machine is the actual source of the performance issue.

Pro move 1: Make sure that applications log and retain connection-specific errors on the client machines.

Pro move 2: Check if the application has a switch or option to enable a debug mode, where database calls and other operations are logged with begin and end timestamps.

Summary

As an Amazon RDS or Aurora customer, there are several different tools you can use to monitor your database workloads.

In this post, I showed you which tools to use in which situation, discussed what each one provides, and provided you with links to explore more detailed information about each tool.

If you have feedback or questions, leave a comment on the comments section.

About the Author

Valter Rehn is a Principal Engineer in AWS Support. He focuses on Amazon RDS and Amazon Aurora, and has been guiding customers on how to best use Aurora since version 1.0 was released in July of 2015.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments