TensorFlow on GKE Autopilot with GPU acceleration

By mullaned2002

July 26, 2023

315

With all the recent interest in Machine Learning and Artificial Intelligence, you might be wondering: what’s the best place to run my AI/ML workloads?

This is why we built the Autopilot mode of operation for Google Kubernetes Engine (GKE) with GPU support. Autopilot takes care of all the infrastructure, so you can focus on running AI/ML workloads, whether for inference, training, or any other GPU task. You simply provide the Pod or Job definition with your container, schedule it on Autopilot and we will provision the right GPU and execute the workload. You’re only billed while the Job is running too, so once it completes (or you terminate it), the charges stop immediately, and we’ll take care of the cleanup.

Sound too good to be true?

In this post, I’ll demo the creation, execution and teardown of an AI/ML workload. The workload is a Tensorflow-enabled Jupiter notebook running on a NVIDIA T4, which we can use to run a bunch of different AI/ML training examples. Jupiter notebooks are great for learning and experimenting with AI/ML, and we’ll mount a persistent disk so that you can even preserve your work between runs.

You can also watch my video demonstration here:

Setup

Start by creating a GKE Autopilot cluster. Since GPUs are not available in every region, choose a region with the GPU you want (the config here uses a NVIDIA T4). Regions with GPUs are shown in the Autopilot pricing table.

Create the cluster:

code_block[StructValue([(u’code’, u’CLUSTER_NAME=test-clusterrnREGION=us-west1rngcloud container clusters create-auto $CLUSTER_NAME \rn –region $REGION \’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0ac3af90>)])]

Installation

Now we can deploy a Tensorflow-enabled Jupyter Notebook with GPU-acceleration.

The following StatefulSet definition creates an instance of the tensorflow/tensorflow:latest-gpu-jupyter container that gives us a Jupyter notebook in a TensorFlow environment. It provisions a NVIDIA T4 GPU, and mounts a PersistentVolume to the /tf/saved path so you can save your work and it will persist between restarts. And it runs in Spot, so you save 60-91% (and remember, our work is saved if it’s preempted).

This is a legit Jupyter Notebook that you can use long term!

code_block[StructValue([(u’code’, u’# Tensorflow/Jupyter StatefulSetrnapiVersion: apps/v1rnkind: StatefulSetrnmetadata:rn name: tensorflowrnspec:rn selector:rn matchLabels:rn pod: tensorflow-podrn serviceName: tensorflowrn replicas: 1rn template:rn metadata:rn labels:rn pod: tensorflow-podrn spec:rn nodeSelector:rn cloud.google.com/gke-accelerator: nvidia-tesla-t4rn cloud.google.com/gke-spot: “true”rn terminationGracePeriodSeconds: 30rn containers:rn – name: tensorflow-containerrn image: tensorflow/tensorflow:latest-gpu-jupyterrn volumeMounts:rn – name: tensorflow-pvcrn mountPath: /tf/savedrn resources:rn requests:rn nvidia.com/gpu: “1”rn ephemeral-storage: 10Girn## Optional: override and set your own tokenrn# tenv:rn# t- name: JUPYTER_TOKENrn# tvalue: “jupyter”rn volumeClaimTemplates:rn – metadata:rn name: tensorflow-pvcrn spec:rn accessModes:rn – ReadWriteOncern resources:rn requests:rn storage: 100Girn—rn# Headless service for the above StatefulSetrnapiVersion: v1rnkind: Servicernmetadata:rn name: tensorflowrnspec:rn ports:rn – port: 8888rn clusterIP: Nonern selector:rn pod: tensorflow-pod’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0ac3a790>)])]

We also need a load balancer, so we can connect to this notebook from our desktop:

code_block[StructValue([(u’code’, u’# External servicernapiVersion: “v1″rnkind: “Service”rnmetadata:rn name: tensorflow-jupyterrnspec:rn ports:rn – protocol: “TCP”rn port: 80rn targetPort: 8888rn selector:rn pod: tensorflow-podrn type: LoadBalancer’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac1153f650>)])]

Deploy them both like so:

code_block[StructValue([(u’code’, u’kubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow.yamlrnkubectl create -f https://raw.githubusercontent.com/WilliamDenniss/autopilot-examples/main/tensorflow/tensorflow-jupyter.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac1153f3d0>)])]

While we’re waiting, we can watch the events in the cluster to make sure it’s going to work, like so (output truncated to show relevant events):

code_block[StructValue([(u’code’, u”$ kubectl get events -wrnLAST SEEN TYPE tREASON tOBJECT tMESSAGErn5m25s tWarning FailedScheduling tpod/tensorflow-0 t0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 3 node(s) didn’t match Pod’s node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.rn4m24s tNormaltTriggeredScaleUp tpod/tensorflow-0 tpod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/gke-autopilot-test/zones/us-west1-b/instanceGroups/gk3-test-cluster-nap-1ax02924-9c722205-grp 0->1 (max: 1000)}]rn2m13s tNormaltScheduled tpod/tensorflow-0 tSuccessfully assigned default/tensorflow-0 to gk3-test-cluster-nap-1ax02924-9c722205-lzgj”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11f1d0d0>)])]

The way Kubernetes and Autopilot works is you’ll initially see FailedScheduling, that’s because at the moment you deploy the code, there is no resource that can handle your Pod. But then you’ll see TriggeredScaleUp, which is Autopilot adding that resource for you, and finally Scheduled once the Pod has the resources. GPU nodes take a little longer than regular CPU nodes to provision, and this container takes a little while to boot. In my case it took about 5min all up from scheduling the Pod to it being running.

Using the Notebook

Now it’s time to connect. First, get the external IP of the load balancer

code_block[StructValue([(u’code’, u’$ kubectl get svcrnNAME tTYPE tCLUSTER-IP tEXTERNAL-IPtPORT(S) tAGErnkubernetes tClusterIP t10.102.0.1 t<none> t443/TCP t20drntensorflow tClusterIP tNone t<none> t80/TCP t9m4srntensorflow-jupyter LoadBalancer 10.102.2.107 34.127.75.81 80:31790/TCP 8m35s’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11f1b910>)])]

And browse to it

We can run the command it suggests in Kubernetes with exec:

code_block[StructValue([(u’code’, u’$ kubectl exec -it sts/tensorflow — jupyter notebook listrnCurrently running servers:rnhttp://0.0.0.0:8888/?token=e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715 :: /tf’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11f1bc10>)])]

Login by copying the token (in my case, e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715) into the input box and hit “Log In”.

Note: if you want to skip this step, you can set your own token in the configuration, just uncomment the env lines and define your own token.

There are 2 folders, one with some included samples and “saved” which is the one we mounted from a persistent disk. I recommend operating out of the “saved” folder to preserve your state between sessions, and moving the included “tensorflow-tutorials” directory into the “saved” directory before getting started. You can use the UI below to move the folder, and upload your own notebooks.

Let’s try run a few of the included samples.

The classification.ipynb example

The overfit_and_underfit.ipynb example

We can upload our own projects, like the examples in the Tensorflow docs. Just download the notebook from the docs, and upload it jupyter to the saved/ folder, and run.

Tensorflow basics.ipynb tutorial, utilizing GPU acceleration

So there it is. We have a reusable TensorFlow Jupyter notebook running on an NVIDIA T4! This isn’t just a toy either, we hooked up a PersistentVolume so your work is saved (even if the StatefulSet is deleted, or the Pod disrupted). We’re using Spot compute to save some cash. And the entire thing was provisioned from 2 YAML files, no need to think about the underlying compute hardware. Neat!

Monitoring & Troubleshooting

If you get a message like “The kernel appears to have died. It will restart automatically.”, then the first step is to tail your logs.

code_block[StructValue([(u’code’, u’kubectl logs tensorflow-0 -f’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0a11fb10>)])]

A common issue I saw was when trying to run two notebooks, I would exhaust my GPU’s memory (CUDA_ERROR_OUT_OF_MEMORY in the logs). The easy fix is to shutdown all but the notebook you are actively using.

You can keep an eye on the GPU utilization like so:

code_block[StructValue([(u’code’, u’$ kubectl exec -it sts/tensorflow — bashrn# watch -d nvidia-smi’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11ec35d0>)])]

If you need to restart the setup for whatever reason, just delete the pod and Kubernetes will recreate it. This is very fast on Autopilot, as the GPU-enabled node resource will hang around for a short time in the cluster.

code_block[StructValue([(u’code’, u’kubectl delete pod tensorflow-0′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0b3b8750>)])]

What’s Next

To shell into the environment and run arbitrary code (i.e. without using the notebook UI), you can use the following. Just be sure to save any data you want to persist in /tf/saved/.

code_block[StructValue([(u’code’, u’kubectl exec -it sts/tensorflow — bash’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0b3b86d0>)])]

If you want some more tutorials, check out the TensorFlow tutorials and Keras.

I cloned the Keras repo onto my persistent volume to have all those tutorials in my notebook as well.

code_block[StructValue([(u’code’, u’$ kubectl exec -it sts/tensorflow — bashrn# cd /tf/savedrn# git clone https://github.com/keras-team/keras-io.gitrn# pip install pandas’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0b3b8090>)])]

If you need any additional Python modules for your notebooks like Pandas, you can set that up the same way. To create a more durable setup though you’ll want your own Dockerfile extending the one we used above (let me know if you want to share such a recipie in a follow up post).

I ran a few different examples, here’s some of the output:

The output of the Keras timeseries/ipynb/timeseries_weather_forecasting.ipynb example

A epoch random iteration in the Keras generative/ipynb/text_generation_with_miniature_gpt.ipynb example

Cleanup

GPUs are not the cheapest resources, so make sure you delete the resources once you are done! Clean up by removing the StatefulSet and services:

code_block[StructValue([(u’code’, u’kubectl delete sts tensorflowrnkubectl delete svc tensorflow tensorflow-jupyter’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11ec3dd0>)])]

Again, the nice thing about Autopilot is that deleting the Kubernetes resources (in this case a StatefulSet and LoadBalancer) will end the associated charges.

That just leaves the persistent disk. You can either keep it around (so that if you re-create the above StatefulSet, it will be reattached and your work will be saved), or if you no longer need it, then go ahead and delete the disk as well.

code_block[StructValue([(u’code’, u’kubectl delete persistentvolumeclaim/tensorflow-pvc-tensorflow-0′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac11c5cb10>)])]

You can delete the cluster if you don’t need it anymore as well.

code_block[StructValue([(u’code’, u’gcloud container clusters delete $CLUSTER_NAME –region $REGION’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eac0bc8e390>)])]

Next Steps

So that’s how easy it is to run GPU workloads on Autopilot!

Just define your Kubernetes workloads including any GPU resources they need, and we’ll take care of the rest. When you’re done, delete the object and the charges stop right away—no need to worry about node clean up.

Head over to https://console.cloud.google.com/kubernetes to get started with your own GKE cluster, and if you’re new to Google Cloud, remember to take advantage of the $300 free trial!

Cloud BlogRead More

Previous articleFaster retailer success with Google Cloud Shelf Checking AI and Cortex Framework with SAP

Next articleThe role of vector datastores in generative AI applications

TensorFlow on GKE Autopilot with GPU acceleration

Setup

Installation

Using the Notebook

Monitoring & Troubleshooting

What’s Next

Cleanup

Next Steps

The overwhelmed person’s guide to Google Cloud: week of April 18

Introducing new ML model monitoring capabilities in BigQuery

The power of choice: Simplifying your regulatory and compliance journey

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 18

Databricks DBRX is now available in Amazon SageMaker JumpStart

Introducing new ML model monitoring capabilities in BigQuery

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Google pulls the plug on Optimize

Finops Sketchnote 3 – Cost and value optimization on Google Cloud

Machine teaching with Microsoft’s Project Bonsai

POPULAR CATEGORY