This is a guest post co-written by Sandesh Achar, Director of Cloud Engineering, and Nathaniel Louis Tisuela, Software Engineer, at Workday, Inc.
Workday, a leader in cloud-based financial management and human capital management software solutions, uses Amazon Aurora PostgreSQL-Compatible Edition as the backend for some of its products. To ease support operations of a multi-cloud environment, Workday built a centralized observability infrastructure, providing a single pane of glass view across multiple product deployments. This infrastructure standardizes the way metrics and logs are collected across the different cloud services in use. Although Amazon Aurora offers native monitoring features, bringing this information into the central environment is required to provide a unified view of the operational health of product deployments to SRE teams.
In this post, we show you the solution Workday uses to integrate Aurora logging and monitoring with a centralized observability environment.
As Workday continues to onboard new customers and grow its product footprint, the operational capabilities behind these infrastructures also need to scale and adapt. To achieve this goal, the following criteria were considered in designing the solution:
Scalability – As Workday’s use of Aurora expands with the addition of new customers, it’s important that the solution is scalable.
Security – Security is of utmost concern because Workday products deal with sensitive human resources and financial data.
Automation – The solution needs to provide a seamless experience supported by Aurora native monitoring features without manual intervention and delays.
Reliability – The solution has to be highly available and reliable because it supports the operations for mission-critical customer deployments.
Maintainability – The solution needs to be simple to manage with minimal operational burden on SRE teams.
The solution to integrate Aurora monitoring with the central observability environment is implemented using a containerized approach in Kubernetes. The containerized service is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). The service collects metrics and logs from Amazon CloudWatch and metrics from Aurora databases. The solution is compatible with both Aurora provisioned and Aurora serverless v2 deployments. The collected metrics and logs are then sent to the central observability environment for ingestion. In the observability environment, the logs are stored in Elasticsearch and the metrics are stored in InfluxDB. The following diagram shows the high-level architecture of the solution.
The solution has two main components: the logging and monitoring pod, and the monitoring and logging operator. The following diagram depicts the solution design and its components.
Let’s look at these components in more detail.
Logging and monitoring pod
A dedicated Kubernetes pod collects CloudWatch logs, CloudWatch metrics, and database metrics for each of the Aurora clusters. Each pod contains the software libraries (Filebeat and Telegraf) required to send logs and metrics from the Aurora cluster to the observability environment. Filebeat captures and send logs from CloudWatch while Telegraf collects CloudWatch and database metrics. Aurora metrics such as database load metrics are retrieved from CloudWatch. Database metrics such as idle database connections are retrieved using PostgreSQL queries to the Aurora database. Each pod provides a custom configuration of Filebeat and Telegraf as per the monitoring requirements for each of the Aurora clusters.
Monitoring and logging operator
The monitoring and logging operator manages the Kubernetes resources in the solution. This includes secrets, config maps, and deployments for the logging and monitoring of Aurora clusters. This is implemented using the Kubernetes API. The following diagram outlines the operator process flow.
The operator uses the AWS SDK for Python (Boto3) to access data from AWS services. This enables the operator to build an accurate picture of each Aurora cluster. This information is pertinent to configuring the appropriate logging and monitoring required for each cluster.
While collecting the inventory of Aurora clusters using the describe-db-clusters API, the database cluster attributes are also validated to verify accuracy. The validation is achieved using a custom specification of Pydantic, a Python data validation package. This validation helps identify and report misconfigurations in Aurora clusters, such as disabled PostgreSQL Log exports preventing log and metric data collection from the cluster.
Monitoring requirements can differ depending on the environment and lifecycle state of the Aurora cluster. Aurora cluster metadata such as tags, state, name, and other details is used to specify the monitoring requirements for a cluster. The monitoring requirements for each Aurora cluster are inferred and an inventory of clusters that require monitoring is selected using Aurora API calls. The information capture is simplified by extending and wrapping multiple APIs together, to allow the retrieval of the cluster details along with the respective log group.
Managing logging and monitoring deployments
To manage logging and monitoring throughout the lifecycle of an Aurora cluster, the operator uses the SDK for Python to keep track of each cluster. It monitors the lifecycle state of each Aurora cluster and instruments logging and monitoring appropriately. It provisions a pod when an Aurora cluster is ready for use, updates the configuration if changes are required, and removes the pod when the cluster is deliberately decommissioned. In essence, decisions on provisioning Kubernetes resources for monitoring are highly dependent on the state of the Aurora clusters. The response from the describe-db-clusters API informs the operator of when deployments should be created, updated, and deleted. The resulting decision is then reported to the observability environment. The following diagram illustrates how the operator manages logging and monitoring deployments.
With this scalable and flexible Kubernetes cluster design deployed on Amazon EKS, and with the automation in the operator, Workday is able to achieve the following benefits:
As Workday gains more customers, the number of Aurora clusters will increase with that growth. The logging and monitoring solution needs to scale with this growth and deal with large influxes of metrics and logs. Amazon EKS helps the solution scale with this growth. This is achieved using AWS Auto Scaling groups and dynamic scaling policies. AWS Auto Scaling groups automatically scale up or down depending on the scaling policies set for the EKS cluster.
The solution uses Amazon EKS security features for access management and data protection. AWS security groups restricts the network traffic from EKS clusters to Aurora clusters. Service account AWS Identity and Access Management (IAM) roles are used to grant the pods the required permissions to access AWS services and resources as needed. The solution helps keep database credentials secure and fulfills the principle of least privilege. The collection and distribution of logs and metrics are done over HTTPS. In addition to this, intrusion detection, threat detection, data protection, security policies, IAM policies, ingress and egress traffic security, and continuous monitoring are implemented to verify the end-to-end security of the solution.
With the use of the SDK for Python APIs, the solution automatically gathers metrics and logs from Aurora clusters in near real-time. This allows the operator to completely automate the deployment and management of Kubernetes pods in alignment with the lifecycle of the Aurora clusters.
Because enterprises like Workday can have hundreds of Aurora clusters, logging and monitoring needs to be distributed to provide fault tolerance. With a containerized approach and Amazon EKS, the solution is able to withstand failures and improve reliability.
Amazon EKS takes care of the Kubernetes control plane management. This minimizes the operational burden on SRE teams to manage Kubernetes nodes. Additionally, the use of consistent tooling and libraries with built-in integration simplifies the maintenance of the solution.
The operator automatically removes Kubernetes pods whenever an Aurora cluster is decommissioned, freeing up resources and scaling down the Kubernetes cluster nodes. Furthermore, as the resource requests are specified at the container level, the solution is able to determine the type of nodes that are best suited for the Kubernetes cluster. With Amazon EKS and auto scaling, the Kubernetes cluster is able to scale the cluster up and down depending on resource consumption. This verifies that the use of resources matches the actual demand to optimize cost.
Maintaining a single pane of glass view to operational health while supporting observability of a complex system such as Aurora is a nuanced task. Kubernetes made this solution feasible, and Amazon EKS brought it to life. At Workday, we’ve found that this approach has made our logging and monitoring solution reliable and scalable while also reducing the operational overhead for maintaining the solution.
To learn more about monitoring Amazon Aurora, refer to logging and monitoring in Amazon Aurora. We look forward to hearing from you about your unique observability requirements and how you are meeting them with AWS!
About the Authors
Poornima Chand is a Senior Solutions Architect in the Strategic Accounts Solutions Architecture team at AWS. She works with customers to help solve their unique challenges using AWS technology solutions. She enjoys architecting and building scalable solutions.
Sandesh Achar is a Cloud Distributed Computing Architect and Technology Leader with experience in building, scaling, and managing a world-class Cloud Engineering team. He currently leads the Cloud Infrastructure Engineering, Site Reliability Engineering, and Database Engineering teams, spread globally, for the fastest growing product line at Workday.
Nathaniel Louis Tisuela is a Software Development Engineer in the Planning Observability Engineering team at Workday. He works on solving complex observability technology stack issues and spends most of his time diving deep into monitoring projects.
Read MoreAWS Database Blog