GKE plus Filestore – Improve training times for AI/ML workloads by up to 37%

By mullaned2002

February 2, 2024

144

Problem statement and use case

Training AI/ML workloads requires a lot of data, which is often stored in large numbers of small files — think training driverless cars training numerous image data or performing protein analysis, where the training set often consists of numerous small files, sized 100K to 2MB each. When selecting tools for these use-cases, users often turn to Google’s Cloud Storage, which provides low latency and high throughput with reasonable price and performance, and optionally use FUSE as a file interface for portability. However, when the dataset is composed of small files, latency becomes an issue; a training workload can have tens-of-thousands of small file batches per epoch, as well as multiple worker nodes accessing Cloud Storage.

To accelerate load times, users need storage that provides low latency and high throughput. Using Filestore as an “accelerator” can help. Filestore provides fast-access file storage with all the benefits of multiple read/write access and a native POSIX interface. You can still leverage Cloud Storage as your primary storage source, and use Filestore to provide cost-effective, low-latency data access for your worker nodes.

In this blog post, we focus on the important role that Filestore can play in training AI/ML workloads, helping you make informed choices to accelerate your workload performance. Read on to learn how to use this solution according to personas and responsibilities:

Usage details

The following screenshots highlight how to use GKE and Filestore for your AI/ML applications. You can find the full source code in this repository.

Persona 1: Kubernetes Platform admin staging Filestore for use by data scientists

The Kubernetes platform admin is responsible for creating infrastructure for data science teams to consume. In this case, the platform admin sets up Filestore using a Kubernetes persistent volume and makes it accessible to data scientists via a Jupyter Notebook setup, or if working with multiple users, via JupyterHub. The data scientist can then simply access the notebook and write code.

For this specific example, we used the off-the-shelf premium-rwx GKE StorageClass, which dynamically provisions a Filestore Basic SSD instance under the hood. The Jupyter pod specification uses the GKE Filestore CSI driver to provision a PersistentVolumeClaim (PVC) which mounts a Filestore share to the Pod. The mounted volume path (which serves as a cache directory for data and models) is exposed as an environment variable to the data scientist (notebook user).

Screenshot 1: Tensorflow deployment with a Filestore volume

Screenshot 2: Filestore Persistent volume claim

Persona 2: Data scientist accessing data from the Jupyter Notebook

A data scientist simply wants to focus on running experiments. In this example, we train Google’s Vision Transformer Model (ViT) and load a food101 dataset from Hugging Face, which is primarily composed of 100k images totalling 5 GiB. We use Hugging Face’s cache feature, which automatically caches the data on the filesystem after first read. The file path needs to be shared with your data science teams as an environment variable. By passing the Filestore path as an environment variable, the data is cached on Filestore. Since the data is cached and not fetched from Cloud Storage, we run two epochs of training on a a2-highgpu-1g machine and compare training times directly from Cloud Storage (baseline), and observe a 37% improvement in training times!

Screenshot 3: Loading dataset from Hugging face and enabling itching

Screenshot 4: Jupyter notebook file explorer with mounted filestore directory

Screenshot 5,6,7: Download model, start training and measure training time

Table 1. Training results

Storage option

Training Time (Secs)

Improvement

Cloud Storage

4837

Filestore Basic SSD (Filestore Zonal High Band preferred)

3006

37%

Usage considerations

In this blog, we’ve highlighted the benefits of using Filestore as an accelerator in front of Cloud Storage, especially when your dataset is composed of numerous small files. You will be billed for Filestore Instances but the costs of storage may be well worth it when you can accelerate your training times (and GPU resource consumption). When files are larger sizes data directly from Cloud Storage may be a better fit. Choose the best architecture based on your use case.

Cloud BlogRead More

Previous articleAutomate public website indexing for efficient semantic search with Vertex AI

Next articleMonitor embedding drift for LLMs deployed from Amazon SageMaker JumpStart

GKE plus Filestore – Improve training times for AI/ML workloads by up to 37%

Problem statement and use case

Usage details

Persona 1: Kubernetes Platform admin staging Filestore for use by data scientists

Persona 2: Data scientist accessing data from the Jupyter Notebook

Usage considerations

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Defining ‘value’ with cloud architecture

Exploring the Click Element Variable in Google Tag Manager

Google Public Sector and WWT Team Up to Enhance Cloud Sovereignty

POPULAR CATEGORY