With all the recent interest in Machine Learning and Artificial Intelligence, you might be wondering: what’s the best place to run my AI/ML workloads?
This is why we built the Autopilot mode of operation for Google Kubernetes Engine (GKE) with GPU support. Autopilot takes care of all the infrastructure, so you can focus on running AI/ML workloads, whether for inference, training, or any other GPU task. You simply provide the Pod or Job definition with your container, schedule it on Autopilot and we will provision the right GPU and execute the workload. You’re only billed while the Job is running too, so once it completes (or you terminate it), the charges stop immediately, and we’ll take care of the cleanup.
In this post, I’ll demo the creation, execution and teardown of an AI/ML workload. The workload is a Tensorflow-enabled Jupiter notebook running on a NVIDIA T4, which we can use to run a bunch of different AI/ML training examples. Jupiter notebooks are great for learning and experimenting with AI/ML, and we’ll mount a persistent disk so that you can even preserve your work between runs.
You can also watch my video demonstration here:
Setup
Start by creating a GKE Autopilot cluster. Since GPUs are not available in every region, choose a region with the GPU you want (the config here uses a NVIDIA T4). Regions with GPUs are shown in the Autopilot pricing table.
Create the cluster:
Installation
Now we can deploy a Tensorflow-enabled Jupyter Notebook with GPU-acceleration.
The following StatefulSet definition creates an instance of the tensorflow/tensorflow:latest-gpu-jupyter container that gives us a Jupyter notebook in a TensorFlow environment. It provisions a NVIDIA T4 GPU, and mounts a PersistentVolume to the /tf/saved path so you can save your work and it will persist between restarts. And it runs in Spot, so you save 60-91% (and remember, our work is saved if it’s preempted).
This is a legit Jupyter Notebook that you can use long term!
We also need a load balancer, so we can connect to this notebook from our desktop:
Deploy them both like so:
While we’re waiting, we can watch the events in the cluster to make sure it’s going to work, like so (output truncated to show relevant events):
The way Kubernetes and Autopilot works is you’ll initially see FailedScheduling, that’s because at the moment you deploy the code, there is no resource that can handle your Pod. But then you’ll see TriggeredScaleUp, which is Autopilot adding that resource for you, and finally Scheduled once the Pod has the resources. GPU nodes take a little longer than regular CPU nodes to provision, and this container takes a little while to boot. In my case it took about 5min all up from scheduling the Pod to it being running.
Using the Notebook
Now it’s time to connect. First, get the external IP of the load balancer
And browse to it
We can run the command it suggests in Kubernetes with exec:
Login by copying the token (in my case, e54a0e8129ca3918db604f5c79e8a9712aa08570e62d2715) into the input box and hit “Log In”.
Note: if you want to skip this step, you can set your own token in the configuration, just uncomment the env lines and define your own token.
There are 2 folders, one with some included samples and “saved” which is the one we mounted from a persistent disk. I recommend operating out of the “saved” folder to preserve your state between sessions, and moving the included “tensorflow-tutorials” directory into the “saved” directory before getting started. You can use the UI below to move the folder, and upload your own notebooks.
Let’s try run a few of the included samples.
The classification.ipynb example
The overfit_and_underfit.ipynb example
We can upload our own projects, like the examples in the Tensorflow docs. Just download the notebook from the docs, and upload it jupyter to the saved/ folder, and run.
Tensorflow basics.ipynb tutorial, utilizing GPU acceleration
So there it is. We have a reusable TensorFlow Jupyter notebook running on an NVIDIA T4! This isn’t just a toy either, we hooked up a PersistentVolume so your work is saved (even if the StatefulSet is deleted, or the Pod disrupted). We’re using Spot compute to save some cash. And the entire thing was provisioned from 2 YAML files, no need to think about the underlying compute hardware. Neat!
Monitoring & Troubleshooting
If you get a message like “The kernel appears to have died. It will restart automatically.”, then the first step is to tail your logs.
A common issue I saw was when trying to run two notebooks, I would exhaust my GPU’s memory (CUDA_ERROR_OUT_OF_MEMORY in the logs). The easy fix is to shutdown all but the notebook you are actively using.
You can keep an eye on the GPU utilization like so:
If you need to restart the setup for whatever reason, just delete the pod and Kubernetes will recreate it. This is very fast on Autopilot, as the GPU-enabled node resource will hang around for a short time in the cluster.
What’s Next
To shell into the environment and run arbitrary code (i.e. without using the notebook UI), you can use the following. Just be sure to save any data you want to persist in /tf/saved/.
If you want some more tutorials, check out the TensorFlow tutorials and Keras.
I cloned the Keras repo onto my persistent volume to have all those tutorials in my notebook as well.
If you need any additional Python modules for your notebooks like Pandas, you can set that up the same way. To create a more durable setup though you’ll want your own Dockerfile extending the one we used above (let me know if you want to share such a recipie in a follow up post).
I ran a few different examples, here’s some of the output:
The output of the Keras timeseries/ipynb/timeseries_weather_forecasting.ipynb example
A epoch random iteration in the Keras generative/ipynb/text_generation_with_miniature_gpt.ipynb example
Cleanup
GPUs are not the cheapest resources, so make sure you delete the resources once you are done! Clean up by removing the StatefulSet and services:
Again, the nice thing about Autopilot is that deleting the Kubernetes resources (in this case a StatefulSet and LoadBalancer) will end the associated charges.
That just leaves the persistent disk. You can either keep it around (so that if you re-create the above StatefulSet, it will be reattached and your work will be saved), or if you no longer need it, then go ahead and delete the disk as well.
You can delete the cluster if you don’t need it anymore as well.
Next Steps
So that’s how easy it is to run GPU workloads on Autopilot!
Just define your Kubernetes workloads including any GPU resources they need, and we’ll take care of the rest. When you’re done, delete the object and the charges stop right away—no need to worry about node clean up.
Head over to https://console.cloud.google.com/kubernetes to get started with your own GKE cluster, and if you’re new to Google Cloud, remember to take advantage of the $300 free trial!
Cloud BlogRead More