Where is your Cloud Bigtable cluster spending its CPU

By mullaned2002

January 12, 2022

1809

CPU utilization is a key performance indicator for Cloud Bigtable. Understanding CPU spend is essential for optimizing Bigtable performance and cost. We have significantly improved Bigtable’s observability by allowing you to visualize your Bigtable cluster’s CPU utilization in more detail. We now provide you with the ability to break the utilization down by various dimensions like app profile, method and table. This finer grained reporting can help you make more informed application design choices and help with diagnosing performance related incidents.

In this post, we present how this visibility may be used in the real world, through example persona-based user journeys.

User Journey: Investigate an incident with high latency
Target Persona: Site Reliability Engineer (SRE)

ABC Corp runs Cloud Bigtable in a multi-tenant environment. Multiple teams at ABC Corp use the same Bigtable instance.

Alice is an SRE at ABC Corp. Alice gets paged because the tail latency of a cluster exceeded the acceptable performance threshold. Alice looks at the cluster level CPU utilization chart and sees that the CPU usage spiked during the incident window.

P99 latency for app profile personalization-reader spikes

CPU utilization for the cluster spikes

Alice wants to drill down further to get more details about this spike. The primary question she wants to answer is “Which team should I be reaching out to?”

Fortunately, teams at ABC Corp follow the best practice of tagging the usage of each team with an app profile in the following format: <teamname>-<workload-type>

The bigtable instance has the following app profiles:

revenue-updater

info-updater

personalization-reader

personalization-batch-updater

The instance’s data is stored in the following tables:

revenue

client-info

personalization

She uses the CPU per app profile chart to determine that the personalization-batch-updater app profile utilized the most CPU during the time of the incident and also saw a spike that corresponded with the spike in latency of the serving path traffic under the personalization-reader app profile.

At this point, Alice knows that the personalization-batch-updater traffic is adversely impacting the personalization-reader traffic. She further digs into the dashboards in Metrics Explorer to figure out the problematic method and table.

CPU usage breakdown by app profile, table and method

Alice has now identified the personalization-batch-updater app profile, the personalization table and the MutateRows method as the reason for the increase in CPU utilization that is causing high tail latency of the serving path traffic.

With this information, she reaches out to the personalization team to provision the cluster correctly before the batch job starts so that the performance of other tenants is not affected.

The following options can be considered in this scenario:

Run the batch job on a replicated instance with multiple clusters. Provision a dedicated cluster for the batch job and use single cluster routing to completely isolate the serving path traffic from the batch updates

Provision more nodes for the cluster before the batch job starts and for the duration of the batch job. This option is less preferred than option 1, since serving path traffic may still be impacted. However, this option is more cost effective.

User Journey: Schema and cost optimization
Target Persona: Developer

Bob is a developer who is onboarding a new workload on Bigtable. He completes the development of his feature and moves on to the performance benchmarking phase before releasing to production. He notices that both the throughput and latency of his queries are lower than what he expected and begins debugging the issue.

His first step is to look at the CPU utilization of the cluster, which is higher than expected and is hovering around the recommended max.

CPU utilization by cluster

To debug further, he looks at the CPU utilization by app profile and the CPU utilization by table charts. He determines that the majority of the CPU is consumed by the product-reader app profile and the product_info table.

CPU utilization by app profile

CPU utilization by table

He inspects the application code and notices that the query includes a value range filter. He realizes that value filters are expensive, so he moves the filtering to the application. This leads to substantial decrease in Bigtable cluster CPU utilization. Consequently, not only does he improve performance, but he can also lower costs for the Bigtable cluster.

CPU utilization by cluster after removing value range filter

CPU utilization by app profile after removing value range filter

CPU utilization by table after removing value range filter

We hope that this blog helps you to understand why and when you might want to use our new observability metric – CPU per app profile, method and table.

Accessing the metrics

These metrics can be accessed on the Bigtable Monitoring UI under the Tables and Application Profiles tabs. To see the method breakdown, view the metric in Metrics Explorer, which you can also navigate to from Cloud Monitoring UI.

Cloud BlogRead More

Previous article2022 New Year’s resolution: Learn at no cost how to use Google Cloud

Next articleSecure access to Amazon SageMaker Studio with AWS SSO and a SAML application

Where is your Cloud Bigtable cluster spending its CPU

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Document AI introduces powerful new Custom Document Splitter to automate document processing

Enable faster training with Amazon SageMaker data parallel library

Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI workloads

POPULAR CATEGORY