Sharing the inaugural State of Kubernetes Cost Optimization report

By mullaned2002

June 28, 2023

161

The past couple of years have been tough for IT organizations. Between headwinds from the COVID-19 pandemic and other macroeconomic factors, teams have been tasked with optimizing their cloud infrastructure footprint while keeping the services that are core and crucial to the business up and running. Today, we’re excited to publish the inaugural State of Kubernetes Cost Optimization report to provide insights and best practices to the Kubernetes community about running cost-efficient clusters in the public cloud without compromising the performance or reliability of their workloads.

Why we authored this report

The report addresses the intersection of IT organizations looking to reduce costs and the continued rise of Kubernetes adoption across industries. We performed a large-scale analysis of Kubernetes clusters to understand what makes high performers for cost optimization. And now, we’re excited to share our key findings.

aside_block[StructValue([(u’title’, u’Join us on Twitter’), (u’body’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e58cd96e790>), (u’btn_text’, u”), (u’href’, u”), (u’image’, None)])]

How we conducted this research

The report centers around four “golden signals” for Kubernetes cost optimization (not to be confused with the four golden signals of monitoring). These signals, derived from years of collaboration with Kubernetes users, can be used to measure how well you balance workload reliability and cost-optimization of your clusters.

The “golden signals” of Kubernetes cost optimization

Using these golden signals as a baseline measurement, we looked at large-scale, anonymized data from Google Kubernetes Engine (GKE) clusters, sorting clusters into At Risk, Low, Medium, High,andElite segments using a classification tree weighted with quasi-equal intervals. With these five segments in place, we compared and analyzed how high-performing clusters perform against these golden signals.

The most important takeaway: set your requests!

Do you know how many workloads are not setting requests in your production clusters? One of our key observations is that many developers are not setting requests for their workloads. And that’s worrisome. Because Kubernetes reclaims resources when node-pressure occurs, it is critical to set requests for workloads that require even a minimum level of reliability.

Not setting requests implicitly assigns the BestEffort Quality of Service (QoS) class to your Pods. In times of resource scarcity on a given Node — and without any warning or graceful termination — BestEffort Pods are often the first to be killed. This can lead to intermittent performance or reliability issues for your workloads, and can occur depending on Pod resource utilization and where the scheduler places Pods. When these issues arise, they can be difficult to identify and debug.

To identify workloads that do not set requests, you can use one of the following tools:

If you are running Kubernetes clusters in GKE, use the GKE Workloads at Risk dashboard. This identifies workloads that have not set requests across your fleet of GKE clusters, along with other workloads at a performance or reliability risk based on how they have requests set.

If you want a very simple script to list containers that are not setting requests in any Kubernetes cluster, check out kube-requests-checker.

Once you’ve set requests for your workloads, you can then proceed with workload rightsizing. This golden signal is at the heart of the cost optimization journey; if requests more closely reflect reality, then the decisions Kubernetes makes using requests will be more effective.

We are seeing this focus on proper setting of resource requests across the community. Ajay Tripathy, the author of the OpenCost project, noted that “setting appropriate resource requests and prioritizing workload rightsizing…is the biggest area of opportunity for OpenCost users.”

In conclusion

No one team alone is responsible for Kubernetes cost optimization — rather, it’s a joint effort that spans developers, platform admins, and even billing and budget owners. The report contains insights and recommendations for each of these personas in its key findings.

We also know that lessons from these findings are not one-time fixes. Rather, they are continuous practices that you should build into your team culture over time.

To learn more about how we take lessons from the State of Kubernetes Cost Optimization report and build them into GKE, check out the resources below:

A solution guide on best practices for running cost-optimized on GKE

A solution guide on rightsizing your workloads at scale in GKE

A demo video on rightsizing your workloads at scale in GKE

A demo video on using the cloud console for GKE Optimization

An interactive tutorial to try out GKE with sample workloads

And finally, be sure to download the report here.

We will discuss this and other key findings in our @googlecloudtech Twitter Space on August 9th, 2023. Be sure to follow and join us! If you are not able to join, please stay tuned for a series of blog posts where we will review each key finding one by one.

Cloud BlogRead More

Previous articleSafe image generation and diffusion models with Amazon AI content moderation services

Next articleUse cases and best practices to optimize cost and performance with Amazon Neptune Serverless

Sharing the inaugural State of Kubernetes Cost Optimization report

Why we authored this report

How we conducted this research

The most important takeaway: set your requests!

In conclusion

The overwhelmed person’s guide to Google Cloud: week of April 25

Sullivan County debuts generative AI chatbot, Saige, to answer constituent FAQs

Your modernization journey starts with the endpoint. A Forrester Consulting study shows why.

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 25

AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

Sullivan County debuts generative AI chatbot, Saige, to answer constituent FAQs

Product Scoop – April 2024

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Run AlphaFold v2.0 on Amazon EC2

Implementing the Transformer Encoder From Scratch in TensorFlow and Keras

Cloud Spanner connectivity using JetBrains IDEs

POPULAR CATEGORY