Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

By mullaned2002

March 6, 2024

109

Kubernetes is a popular way to run AI workloads like training, and large language model (LLM) serving, including our new open model Gemma. Google Kubernetes Engine (GKE) in Autopilot mode provides a fully managed Kubernetes platform that offers the power and flexibility of Kubernetes but without the need to worry about compute nodes, so you can focus on delivering your own business value through AI. Today we’re excited to announce the new Accelerator compute class in Autopilot that improves GPU support with resource reservation capabilities, and a lower price for most GPU workloads (you can opt in to this pricing today, and eventually all workloads will be migrated). In addition, a new Performance compute class enables high-performance workloads to run on Autopilot mode at scale. Both compute classes also have more available ephemeral storage right on the boot disk, giving you more room to download AI models, etc before needing to configure additional storage via generic ephemeral volumes. With these enhancements, using our fully managed Kubernetes platform for inference and other compute-intensive workloads is even better.

With GKE running in Autopilot mode you avoid the need to specify and provision nodes upfront, and can focus on building the workload and creating your own business value. As a fully managed platform, once your workload is built you can run it with less operational overhead. Today’s news sweetens the deal even further.

Lower-priced GPUs, better discounts

We’re lowering the price for the majority of GPU workloads running on GKE in Autopilot mode, and moving to a new billing model to improve compatibility with other products and experiences in Google Cloud. Now, you can move workloads between the Standard and Autopilot modes of GKE, as well as between Compute Engine VMs and keep your existing Reservations and committed use discounts.

When you enable the new pricing model (by specifying the Accelerate compute class as illustrated in the code sample below), resources are billed based on Compute Engine VM resources, plus a premium for the fully managed experience. Today the new pricing model is an opt-in; after April 30, versions of GKE will be released that automatically migrate GPU workloads to this new model. The price for most workloads resulting from these changes is lower (workloads on NVIDIA T4 GPUs with less than 2 vCPU per GPU see a slight price increase).

Here’s a comparison of the hourly prices for several workload sizes in the us-central1 region for GPU, CPU and Memory resources (storage additional):

GPU

Pod Resource Requests

VM resources

Old price (GPU Pod)

New price (Accelerator Compute Class Pod)

NVIDIA A100 80GB

1 GPU

11 vCPU

148 GB memory

1 GPU

12 vCPU

170 GB memory

$6.09

$5.59

NVIDIA A100 40GB

1 GPU

11 vCPU

74 GB memory

1 GPU

12 vCPU

85 GB memory

$4.46

$4.09

NVIDIA L4

1 GPU

11 vCPU

40 GB memory

1 GPU

12 vCPU

48 GB memory

$1.61

$1.12

NVIDIA T4

1 GPU

1 vCPU

1 GB memory

1 GPU

2 vCPU

2 GB memory

$0.46

$0.47

NVIDIA T4

1 GPU

20 vCPU

40 GB memory

1 GPU

22 vCPU

48 GB memory

$1.96

$1.37

When using the Accelerator compute class, the workload is billed for (and can utilize) the complete node VM capacity, including bursting into resources allocated for system Pods.

To opt in to these changes today, upgrade to version 1.28.6-gke.1095000 or later, and add the compute-class selector to your existing GPU workloads, like so:

code_block
<ListValue: [StructValue([(‘code’, ‘apiVersion: v1rnkind: Podrnmetadata:rn name: my-gpu-podrnspec:rn nodeSelector:rn cloud.google.com/compute-class: “Accelerator”rn cloud.google.com/gke-accelerator: nvidia-l4rn containers:rn – name: my-gpu-containerrn image: nvidia/cuda:11.0.3-runtime-ubuntu20.04rn command: [“/bin/bash”, “-c”, “–“]rn args: [“while true; do sleep 600; done;”]rn resources:rn limits:rn nvidia.com/gpu: 1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb179dc38b0>)])]>

High-performance CPU resources

If you need dedicated CPU resources for your workloads, Autopilot now takes a similar approach as it does with GPUs. You can now run GKE Autopilot workloads on Compute Engine’s main machine families including the new C3, C3D and H3 machines, as well as C2, C2D, and more! These resources can be requested as part of the Performance compute class. Here’s an example:

code_block
<ListValue: [StructValue([(‘code’, ‘apiVersion: v1rnkind: Podrnmetadata:rn name: performance-podrnspec:rn nodeSelector:rn cloud.google.com/compute-class: Performancern cloud.google.com/machine-family: c3rn containers:rn – name: my-containerrn image: “k8s.gcr.io/pause”rn resource:rn requests:rn cpu: 20rn memory: “100Gi”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb179dc3100>)])]>

Reservations

Reservations can help ensure that your project has resources for future increases in demand, but previously you weren’t able to consume reservations in Autopilot mode. Good news, now you can! Using reservations is a breeze, and they can be used with both GPUs (when you opt in to the new model), and high-performance CPUs.

Larger boot disks

While GKE allows you to mount multiple persistent volumes to a container, each of which can be up to 64TB on any path in your container, offering larger boot disks for Pods lets you use ephemeral storage without mounting a separate volume. When using either the Performance or Accelerator compute-class labels above, your workload can now consume up to 122GiB of ephemeral storage. Need more? Persistent disks can be mounted to expand further.

Hardware when you need it, simplicity when you don’t

You may be wondering, where do regular Autopilot Pods fit in with this new model? Think about it like this: if you have a workload that requires dedicated, high-performance CPU hardware such as that offered by C3 machines, you can annotate just that workload with those requirements using the node selector described above.

But what about supporting workloads that run alongside the primary ones but don’t need the same computing power? This is where Autopilot mode really excels: by default, all those other workloads will continue to run on the standard Pod model, offering great price/performance for workloads that don’t have high-performance CPU needs. In Autopilot mode, just annotate those workloads that need specialized hardware, like a specific GPU or machine family, and we’ll do the rest. Leave the other workloads blank, and rest assured that they won’t accidentally run on the specialized hardware. This way, you get the best value out of each of your execution environments: broadly applicable defaults in Autopilot, and specialized hardware when you need it.

Here’s what our customers are saying

“At Contextual AI, we are building the next generation of Retrieval Augmented Generation (RAG). Contextual Language Models (CLMs) are end-to-end optimized to address pain points of RAG 1.0 and help enterprise customers build production-grade workflows. To achieve this, we rely on GKE Autopilot, a fully managed Kubernetes service that handles the complexity of running our application. With GKE Autopilot, we can easily scale our pods, optimize our resource utilization, and ensure the security and availability of our nodes. We also take advantage of the new billing models that offer more cost-effective GPUs for our inference tasks, while using regular Autopilot pods for our non-GPU services. We are excited to use GKE Autopilot to power CLMs while saving us money and improving our performance.” – Soumitr Pandey, Member of Technical Staff, Contextual AI

“We opted for GKE Autopilot for our ML infrastructure as it empowers our team to concentrate on research and development instead of cluster management. This approach not only automates resource provisioning throughout the entire regional cluster but also streamlines our operations. The latest enhancements in Autopilot are particularly exciting. They not only provide a unified resource pool but also introduce reservation capabilities, giving us greater control in meeting project deadlines.” – Jon Mason, CEO, Hotspring

To learn more about all the new features that we launched for Autopilot this week, check out the following resources:

Deploy GPU workloads in Autopilot

Consume capacity reservations in Autopilot clusters

Run CPU-intensive workloads with optimal performance

AI/ML orchestration on GKE

Cloud BlogRead More

Previous articleEfficiently fine-tune the ESM-2 protein language model with Amazon SageMaker

Next articleDZ BANK unlocks 70% toil savings and 90% cost savings with a Cloud Run-first approach

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

Lower-priced GPUs, better discounts

High-performance CPU resources

Reservations

Larger boot disks

Hardware when you need it, simplicity when you don’t

Here’s what our customers are saying

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

How Mr. Cooper is using AI to increase speed and accuracy for mortgage processing

The lost art of cloud application engineering

How the year’s final Google Cloud Security Talks will ready you for security and cloud success in 2023

POPULAR CATEGORY