This guest post is co-authored by Ryota Shima, Application Architect, and Kazuki Matsumura, Lead Architect at NRI Digital.
NRI Digital has a wide variety of systems in production, both on-premises and cloud-based. Among them, many systems are built on AWS, and Amazon Aurora and Amazon Relational Database Service (Amazon RDS) are often used as the database tier.
For systems running in production, the top root cause of downtime is database-related performance issues. Even if the system has been thoroughly tested in a test environment, production workloads are difficult to predict in advance. Therefore, the key to operating a high quality production system is to be able to quickly detect issues, identify the cause, and respond to the problem when it occurs.
However, database performance issues are difficult to analyze and require a considerable skill set to solve. When database-related performance issues occurred in our production environments, developers would spend hours gathering and analyzing information in order to find the root cause. If developers didn’t have enough information to find the root cause, then the problem would reoccur. Furthermore, the recent advancements in cloud computing and cloud-native technologies have made our system environment more decentralized, further increasing our challenge of effectively identifying the issue.
NRI Digital believes that there are limits to approaches that rely on the skills of individuals and teams, and believes that machine learning (ML)-powered services and services that analyze production workloads in real time, identify bottlenecks, and recommend areas for improvement will become necessary in the future.
Amazon DevOps Guru for RDS was announced at AWS re:Invent 2021. The service is a new ML-powered capability of Amazon DevOps Guru that is designed to allow developers and DevOps engineers to quickly detect, diagnose, and remediate a wide variety of database-related issues in Amazon RDS.
In this post, we describe the DevOps Guru for RDS proof of concept (POC) process for NRI Digital.
How Amazon DevOps Guru for RDS can solve our problem
NRI Digital considers the following items as necessary for performance monitoring.
Detection – Detect anomalies quickly
Root cause analysis – Immediately identify what’s happening on the database that is causing performance degradation, and determine the root cause
Response (workaround or permanent) – Identify the specific remediation process for problematic areas and apply them to fix the problem and prevent recurrence. The problem can be corrected and recurrence can be prevented by applying the improvement method
The POC comprised of running multiple SQLs for several minutes, mixing normal SQLs and inefficient SQLs with high load, in order to easily experience the full functionality and usability of DevOps Guru for RDS. For this POC, we used Amazon Aurora MySQL-Compatible Edition.
The process started and an alert email was sent to engineers shortly after, as shown in the following screenshot.
This notification indicated an anomaly in the DB load metric. For detailed analysis, we went to the DevOps Guru for RDS dashboard.
On the DevOps Guru Reactive insights page, we noticed that an “RDS DB load anomaly” was an on-going event. By choosing RDS DB Load Anomalous, we reviewed the aggregated metrics.
On the anomaly page, we saw the aggregated metrics that DevOps Guru detected as anomalous. By choosing View analysis under DB Load, we could see the analysis for this metric.
The first part of the analysis page helped visualize the anomalous metrics, and the bottom section provided the analysis results and recommendations.
The analysis section on the left provided the following information:
ActiveSession exceeds 9
The cause of the stagnation of ActiveSession is the DB load related to I/O wait
The DB load related to the I/O wait accounts for 82% of the total DB load
ActiveSession needs to be lowered to 2 in order for this database to be in a normal state
It was easy to determine what was happening on the database that was slowing the database performance.
The next step was to find out what was causing this event. The recommendation on the right side of the Analysis and recommendations section listed SQL digest IDs to investigate.
By choosing View Top SQL in Performance Insights, we could check the likely causal SQL information using Amazon RDS Performance Insights.
We were able to easily identify what was causing the performance delays. In this case, we identified the SQL that was causing the latency and statistics (average latency).
Then we checked the recommendations for specific remedies for the SQL that seemed to be causing the problem.
A link was provided to the Aurora MySQL troubleshooting guide for the wait event io/table/sql/handler. It described the possible causes of increased wait events and the actions to take to investigate.
As of this writing, DevOps Guru for RDS doesn’t provide recommendations for how to optimize SQL queries, but we look forward to when this additional functionality becomes available. In the meantime, we proceeded to tune the query in question ourselves.
Based on the results of this verification and the specification survey, a comparison of performance monitoring with and without DevOps Guru for RDS is shown in the following table.
Root Cause Analysis
Without DevOps Guru for RDS
It was difficult to design monitoring that combines appropriate parameters and thresholds, and to tune it according to fluctuating workloads.
Even if we refer to the status in Performance Insights or other sources, it was difficult to determine what the problem was.
In addition, the previous state must be manually retained in order to compare it with the previous performance data.
Because it depends on the skill set of the users, it took a lot of time and cost to improve events in projects where DB experts weren’t available.
With DevOps Guru for RDS
The system automatically finds database performance issues by identifying anomalous behavior of the DB load metric that is different from the historical baseline.
Based on the analysis results, the system pinpoints and recommends possible problem areas, making it easy to identify the root cause of the problem.
It also automatically analyzes the difference from the previous state.
The time to identify the root cause was reduced significantly with DevOps Guru for RDS because the recommended action items point to you to where you should go, but it’s difficult to resolve the problem completely without a certain level of expertise.
Through our POC, we were able to determine that DevOps Guru for RDS has the potential to fundamentally solve the conventional performance monitoring issues we mentioned in this post.
Long-term performance problems can lead to loss of end-user trust and an increased potential for lost opportunities. DevOps Guru for RDS has the potential to allow our teams with any skill set to approach problem resolution quickly, rather than be dependent on individual skills and expertise. Refer to Amazon DevOps Guru for RDS to learn more.
NRI Digital, established in August 2016, is a digital business specialist of the Nomura Research Institute (NRI) Group. Experts in consulting and solutions in the digital field work with client companies to support them from conceptualization of digitalization strategies, selection and construction of advanced IT solutions, support for business execution, and verification and improvement of the entire project.
About the Authors
Read MoreAWS Database Blog