Cloud Storage as a File System in AI Training

By mullaned2002

November 17, 2021

711

Cloud Storage is a common choice for Vertex AI and AI Platform users to store their training data, models, checkpoints and logs. Now, with Cloud Storage FUSE, training jobs on both platforms can access their data on Cloud Storage as files in the local file system.

This post introduces the Cloud Storage FUSE for Vertex AI Custom Training. On AI Platform Training, the feature is very similar.

Cloud Storage FUSE provides 3 benefits over the traditional ways of accessing Cloud Storage:

Training jobs can start quickly without downloading any training data.

Training jobs can perform I/O easily at scale, without the friction of calling the Cloud Storage APIs, handling the responses, or integrating with client-side libraries.

Training jobs can leverage the optimized performance of Cloud Storage FUSE.

The problems

Traditionally, training jobs have two ways to use data from Cloud Storage.

They can use gsutil to download the entire dataset prior to training. This may take hours depending on the dataset size, which significantly slows down the start-up of the jobs.

They can call Cloud Storage APIs directly or from a client library integrated. This way greatly adds complexity to the training code and thus the cost for development and maintenance.

Cloud Storage FUSE

Cloud Storage FUSE is a File System in User Space (FUSE) mounted on Vertex AI systems.

When you start a custom training job, the job sees a directory /gcs which contains all the Cloud Storage buckets as subdirectories. The job can visit the subdirectories (ie. buckets) when certain permissions are granted.

For instance, training jobs can read from file /gcs/example-bucket/data.csv to get the training data stored in object gs://example-bucket/data.csv

Training jobs can also write to the bucket:

Permissions

Users can assign service accounts to the training jobs to configure their permissions for the Cloud Storage buckets.

If the training job is assigned without a service account, it is allowed to access all the buckets owned by the same project.

If the training job is assigned with a service account that has Cloud Storage Roles, it has the permissions given by the roles.

For instance, you may create a service account as

storage.objectAdmin to bucket A, and

storage.objectViewer to bucket B.

If you assign it to your training job, your training job will be able to

read and write in bucket A, and

read only in bucket B.

The training job will fail with error “permission denied” if it tries to write to bucket B.

Performance

The I/O is often a bottleneck for training jobs with large datasets. Here are some tips to improve the read throughput of the Cloud Storage FUSE:

Store data in large files to reduce the number of files used in the training. Fewer files mean less lookup overhead in locating and opening objects in Cloud Storage.

Use multiple threads. Higher concurrency utilizes the bandwidth better.

Keep the files warm. Files to be accessed frequently (ie warm) are generally better cached and have better performance being read.

Restrictions

Cloud Storage FUSE is not a POSIX compliant file system. Therefore, some usage in a POSIX file system would have unwanted results, which should be avoided.

Directories:

The root directory `/gcs` is not readable. If you run ls /gcs, you will get an “Input/output error”. However, it is okay to read the bucket root such as ls /gcs/example-bucket.

Renaming a directory is not atomic. A renaming operation interrupted would leave a partial result with some files in the new directory, while others in the old directory. A directory with too many direct and indirect files cannot be renamed.

Files:

Hard links are not supported.

File metadata such as ownership, permissions, mtime, extended attributes, are not supported. Do not rely on file metadata for training logic.

Flushing files pushes the entire file to Cloud Storage, which is expensive. Closing a file leads to a flush. Therefore, one should avoid frequent file closes and flushes.

Concurrent write to a file would lead to data corruption.

Logs

You can find the logs from Cloud Storage FUSE to help you diagnose the errors in training.

First, you follow the link to the Cloud Log Explorer on the training job’s page in Pantheon. In the explorer, you can run queries to inspect the logs generated from your training job.

Second, you can view the logs with “gcsfuse” in the resource.labels.taskName property. For instance, the task name “workerpool0-0.gcsfuse” indicates the log is from the Cloud Storage FUSE mounted for the first worker “0” in the first worker pool “workerpool0”.

What’s next

You can find more information on Cloud Storage Fuse in documentation:

http://cloud/vertex-ai/docs/training/code-requirements#fuse

https://cloud.google.com/storage/docs/gcs-fuse

You can also find code samples using Cloud Storage FUSE for Vertex AI Custom Training:

https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/master/community-content

Cloud BlogRead More

Previous articleTaking customer conversations to the next level with AI-powered Business Messages

Next articlePost Title

Cloud Storage as a File System in AI Training

The problems

Cloud Storage FUSE

Permissions

Performance

Restrictions

Logs

What’s next

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Supporting power grids with demand response at Google data centers

Unify analytics with Spark procedures in BigQuery, now generally available

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

POPULAR CATEGORY