Reading and storing data for custom model training on Vertex AI

By mullaned2002

January 11, 2023

551

Before you can train ML models in the cloud, you need to get your data to the cloud.

But when it comes to storing data on Google Cloud there are a lot of different options. Not to mention the different ways you can read in data when designing input pipelines for custom models. Should you use the Cloud Storage API? Copy data directly to the machine where your training job is running? Use the data I/O library of your preferred ML framework?

To make things a little easier for you, we’ve outlined some recommendations for reading data in your custom training jobs on Vertex AI. Whether your use case requires structured or unstructured data, these tips will help you to build more efficient input pipelines with Vertex AI.

Unstructured Data

Cloud Storage FUSE

If you have unstructured data, such as images, the best place to start is by uploading your data to a Cloud Storage bucket. Instead of using gsutil to copy all of the data over to the machine where your custom training job will run, or calling the Cloud Storage APIs directly or from a client library, you can leverage Cloud Storage FUSE.

Using the Cloud Storage FUSE tool, training jobs on Vertex AI can access data on Cloud Storage as files in the local file system. When you start a custom training job, the job sees a directory /gcs, which contains all your Cloud Storage buckets as subdirectories. This happens automatically without any extra work on your part.

Not only does this make it easy to access your data, but it also provides high throughput for large file sequential reads.

For example, if your data is a collection of JPEG files in a Cloud Storage bucket called training-images you can access this data in your training code with the path /gcs/training-images.

If you were to build a TensorFlow model, your code might look something like this:

code_block[StructValue([(u’code’, u”import tensorflow as tfrn rnDATA_DIR = ‘/gcs/training-images’rndataset = tf.keras.utils.image_dataset_from_directory(data_dir=DATA_DIR)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1610004a10>)])]

And if you’re a PyTorch user, your code might look something like this:

code_block[StructValue([(u’code’, u”import torchrnfrom torchvision import datasetsrn rnDATA_DIR = ‘/gcs/training-images’rndataset = datasets.ImageFolder(DATA_DIR)rndataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e160ada2fd0>)])]

Mount an NFS Share

While Cloud Storage FUSE is easy to use and will work for most cases, if you need particularly high throughput you can consider mounting a Network File System (NFS) share for custom training. This allows your jobs to access remote files as if they are local with high throughput and low latency.

Before you begin, there are two steps you’ll need to take:

First, create an NFS share in a Virtual Private Cloud (VPC). Your share must be accessible without authentication.

Then, follow the instructions in Set up VPC Network Peering to peer Vertex AI with the VPC that hosts your NFS share.

Once you have the NFS share and VPC peering set up, you are ready to use NFS with your custom training jobs on Vertex AI.

When you create your custom training job, you’ll need to specify the nfsMounts field and network fields. You can do this in a config.yaml file:

code_block[StructValue([(u’code’, u”network: projects/PROJECT_NUMBER/global/networks/defaultrnworkerPoolSpecs:rn – machineSpec:rn machineType: n1-standard-8rn replicaCount: 1rn containerSpec:rn imageUri: ‘gcr.io/PROJECT_ID/nfs-demo:latest’rn nfsMounts:rn – server: 10.76.0.10rn path: /filesharern mountPoint: my_mount”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e160ada2ed0>)])]

And then pass in the config when submitting the job:

code_block[StructValue([(u’code’, u’gcloud ai custom-jobs create \rn –region={LOCATION} \rn –display-name={JOB_NAME} \rn –config=config.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e160ada23d0>)])]

Structured Data

Multiple options exist when you want to train a machine learning model on structured data. Most of the time, you’ll use BigQuery for storing the training data. When you can’t use BigQuery, for example, if you want to use the TFRecord format, you can follow the instructions described in the unstructured section above.

In the second part of this blog, we’ll discuss the best options for reading training data from BigQuery. Note that there might be other options, but we’ll focus on some of the best and easiest to get started with options.

Structured data with BigQuery

TensorFlow and BigQuery

When your data sits in BigQuery then that’s a great start. If you’re a TensorFlow user, you can use the BigQuery Connector to read training data. The BigQuery connector relies on the BigQuery Storage API,which provides fast access to BigQuery’s managed storage using an rpc-based protocol.

The BigQuery connector mostly follows the BigQuery Storage API flow, but hides the complexity associated with decoding serialized data rows into Tensors. You need to follow these steps:

Create a BigQueryClient client.

Use the BigQueryClient to create a BigQueryReadSession object corresponding to a read session. A read session divides the contents of a BigQuery table into one or more streams for reading the data.

Call parallel_read_rows on the BigQueryReadSession object to read from multiple BigQuery streams in parallel.

If you’re using TensorFlow, your code might look something like this:

code_block[StructValue([(u’code’, u’from tensorflow_io.bigquery import BigQueryClientrnfrom tensorflow_io.bigquery import BigQueryReadSessionrn rn# create BigQueryClientrnclient = BigQueryClient()rn rn# create BigQueryReadSessionrnread_session = client.read_session(PROJECT_ID,rn TABLE_ID,rn DATASET_ID,rn selected_fields=[],rn output_types=[],rn default_values=[],rn rn# set the DataFormat data_format=BigQueryClient.DataFormat.AVRO)rn rn# call parallel_read_rowsrndataset = read_session.parallel_read_rows()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e160ada2f90>)])]

BigQuery alternatives

If you’re not using TensorFlow, then there are some alternatives you can look at. Here are two depending if you are a PyTorch or XGBoost user.

PyTorch and BigQuery

If you’re a PyTorch user, there are multiple options for reading data from BigQuery. We recommend you create an iterable-style DataPipe using the torchdata.datapipes.iter.IterDataPipe()class. When creating a DataPipe you can leverage the BigQuery Storage Read API for reading your training data.

XGBoost and BigQuery

When using XGBoost with Vertex AI, you can use scalable Python on BigQuery using Dask and NVIDIA RAPIDS. Dask offers integration with XGBoost. It’s possible to extend Dask with RAPIDS, a suite of open-source libraries and APIs to execute GPU-accelerated pipelines directly on BigQuery storage. The code for Dask would look something like this:

code_block[StructValue([(u’code’, u’import dask_bigqueryrn rn# read data from BigQuery rndask_df = dask_bigquery.read_gbq(rn project_id=”your_project_id”,rn dataset_id=”your_dataset”,rn table_id=”your_table”,rn)rn rn# inspect dataframe rndask_df.head()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e160ada2bd0>)])]

Alternatively, BigQuery has support for boosted tree models through BigQuery ML. This way you don’t have to take your data out of BigQuery.

All in one overview

What is next

Efficient data pipelines are a key piece of effective ML experimentation and iteration. In this blog we looked at several recommendations for reading structured and unstructured data in your custom training jobs. If you’re looking to get started training some ML models of your own on Vertex AI, check out this introductory video series or run through this codelab. Now it’s time to train some ML models of your own!

Cloud BlogRead More

Previous articleRun faster and more cost-effective Dataproc jobs

Next articleEnriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker

Reading and storing data for custom model training on Vertex AI

Unstructured Data

Structured Data

All in one overview

What is next

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

5 tips for aspiring and junior data engineers

An Introduction to IPv6 on Google Cloud

Managed disaster recovery with Amazon RDS for Oracle cross-Region automated backups – Part 1

POPULAR CATEGORY

Reading and storing data for custom model training on Vertex AI

Unstructured Data

Structured Data

All in one overview

What is next

Cloud Storage as a File System in AI Training

Access larger dataset faster and easier to accelerate your ML models training in Vertex AI

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY