Tuesday, May 28, 2024
HomeArtificial Intelligence and Machine LearningA Gentle Introduction to tensorflow.data API

# A Gentle Introduction to tensorflow.data API

Last Updated on July 12, 2022

When we build and train a Keras deep learning model, the training data can be provided in several different ways. Presenting the data as a NumPy array or a TensorFlow tensor is a common one. Making a Python generator function and let the training loop to read data from it is another way. Yet another way of providing data is to useÂ tf.dataÂ dataset.

In this tutorial, we will see how we can useÂ tf.data dataset for a Keras model. After finishing this tutorial, you will learn:

How to create and useÂ tf.dataÂ dataset
The benefit of doing so compared to a generator function

Letâ€™s get started.

A Gentle Introduction to tensorflow.data API
Photo by Monika MG. Some rights reserved.

## Overview

Training a Keras Model with NumPy Array and Generator Function
Creating a Dataset using tf.data
Creating a Dataest from Generator Function
Data with Prefetch

## Training a Keras Model with NumPy Array and Generator Function

Before we see how the tf.data API works, letâ€™s review how we usually train a Keras model.

First, we need a dataset. An example is the fashion MNIST dataset that comes with the Keras API, which we have 60,000 training samples and 10,000 test samples of 28Ã—28 pixels in grayscale and the corresponding classification label is encoded with integers 0 to 9.

The dataset is a NumPy array. Then we can build a Keras model for classification, and with the modelâ€™s fit() function, we provide the NumPy array as data.

The complete code is as follows:

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

(train_image, train_label), (test_image, test_label) = load_data()
print(train_image.shape)
print(train_label.shape)
print(test_image.shape)
print(test_label.shape)

model = Sequential([
Flatten(input_shape=(28,28)),
Dense(100, activation=”relu”),
Dense(100, activation=”relu”),
Dense(10, activation=”sigmoid”)
])
loss=”sparse_categorical_crossentropy”,
metrics=”sparse_categorical_accuracy”)
history = model.fit(train_image, train_label,
batch_size=32, epochs=50,
validation_data=(test_image, test_label), verbose=0)

print(model.evaluate(test_image, test_label))

plt.plot(history.history[‘val_sparse_categorical_accuracy’])
plt.show()

Running this code will print out the following:

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)
313/313 [==============================] – 0s 392us/step – loss: 0.5114 – sparse_categorical_accuracy: 0.8446
[0.5113903284072876, 0.8446000218391418]

And also create the following plot of validation accuracy over the 50 epochs we trained our model:

The other way of training the same network is to provide the data from a Python generator function instead of a NumPy array. A generator function is the one with a yield statement to emit data while the function is running in parallel to the data consumer. A generator of the fashion MNIST dataset can be created as follows:

def batch_generator(image, label, batchsize):
N = len(image)
i = 0
while True:
yield image[i:i+batchsize], label[i:i+batchsize]
i = i + batchsize
if i + batchsize > N:
i = 0

This function is supposed to be call with the syntax batch_generator(train_image, train_label, 32). It will scan the input arrays in batches indefinitely. Once it reaches the end of the array, it will restart from the beginning.

Training a Keras model with a generator is similar, using the fit() function:

history = model.fit(batch_generator(train_image, train_label, 32),
steps_per_epoch=len(train_image)//32,
epochs=50, validation_data=(test_image, test_label), verbose=0)

Instead of providing the data and label, we just need to provide the generator as the generator will give out both. When data are presented as NumPy array, we can tell how many samples are there by looking at the length of the array. Keras can complete one epoch when the entire dataset is used once. However, our generator function will emit batches indefinitely so we need to tell when an epoch is ended, using the steps_per_epoch argument to the fit() function.

While in the above code, we provided the validation data as NumPy array, we can also use a generator instead and specify validation_steps argument.

The following is the complete code using generator function, which the output is same as the previous example:

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

(train_image, train_label), (test_image, test_label) = load_data()
print(train_image.shape)
print(train_label.shape)
print(test_image.shape)
print(test_label.shape)

model = Sequential([
Flatten(input_shape=(28,28)),
Dense(100, activation=”relu”),
Dense(100, activation=”relu”),
Dense(10, activation=”sigmoid”)
])

def batch_generator(image, label, batchsize):
N = len(image)
i = 0
while True:
yield image[i:i+batchsize], label[i:i+batchsize]
i = i + batchsize
if i + batchsize > N:
i = 0

loss=”sparse_categorical_crossentropy”,
metrics=”sparse_categorical_accuracy”)
history = model.fit(batch_generator(train_image, train_label, 32),
steps_per_epoch=len(train_image)//32,
epochs=50, validation_data=(test_image, test_label), verbose=0)
print(model.evaluate(test_image, test_label))

plt.plot(history.history[‘val_sparse_categorical_accuracy’])
plt.show()

## Creating a Dataset using tf.data

Given we have the fashion MNIST data loaded, we can convert it into a tf.data dataset, like the following:

dataset = tf.data.Dataset.from_tensor_slices((train_image, train_label))
print(dataset.element_spec)

This prints the datasetâ€™s spec, as follows:

(TensorSpec(shape=(28, 28), dtype=tf.uint8, name=None),
TensorSpec(shape=(), dtype=tf.uint8, name=None))

We can see the data is a tuple (as we passed a tuple as argument to the from_tensor_slices() function), whereas the first element is in shape (28,28) while the second element is a scalar. Both elements are stored as 8-bit unsigned integers.

If we do not present the data as a tuple of two NumPy array when we create the dataset, we can also do it later. The following is creating the same dataset but first create the dataset for the image data and label separately before combining them:

train_image_data = tf.data.Dataset.from_tensor_slices(train_image)
train_label_data = tf.data.Dataset.from_tensor_slices(train_label)
dataset = tf.data.Dataset.zip((train_image_data, train_label_data))
print(dataset.element_spec)

This will print the same spec:

(TensorSpec(shape=(28, 28), dtype=tf.uint8, name=None),
TensorSpec(shape=(), dtype=tf.uint8, name=None))

The zip() function in dataset is like the zip() function in Python in the sense that it matches data one-by-one from multiple datasets into a tuple.

One benefit of using tf.data dataset is the flexibility in handling the data. Below is the complete code on how we can train a Keras model using dataset, which the batch size is set to the dataset:

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

(train_image, train_label), (test_image, test_label) = load_data()
dataset = tf.data.Dataset.from_tensor_slices((train_image, train_label))

model = Sequential([
Flatten(input_shape=(28,28)),
Dense(100, activation=”relu”),
Dense(100, activation=”relu”),
Dense(10, activation=”sigmoid”)
])

history = model.fit(dataset.batch(32),
epochs=50,
validation_data=(test_image, test_label),
verbose=0)

print(model.evaluate(test_image, test_label))

plt.plot(history.history[‘val_sparse_categorical_accuracy’])
plt.show()

This is the simplest use case of using a dataset. If we dive deeper, we can see that a dataset is just an iterator. Therefore we can print out each sample in a dataset using the following:

for image, label in dataset:
print(image) # array of shape (28,28) in tf.Tensor
print(label) # integer label in tf.Tensor

The dataset has many functions built-in. The batch() we used before is one of them. If we create batches from dataset and print it, we have the following:

for image, label in dataset.batch(32):
print(image) # array of shape (32,28,28) in tf.Tensor
print(label) # array of shape (32,) in tf.Tensor

which each item we get from a batch is not a sample but a batch of samples. We also have functions such as map(), filter(), and reduce() for sequence transformation, or concatendate() and interleave() for combining with another dataset. There are also repeat(), take(), take_while(), and skip() like our familiar counterpart from Pythonâ€™s itertools module. A full list of the functions can be found from the API documentation.

## Creating a Dataset from Generator Function

So far, we saw how dataset can be used in place of a NumPy array in training a Keras model. Indeed, a dataset can also be created out of a generator function. But instead of a generator function that generates a batch as we saw in one of the example above, here we make a generator function that generates one sample at a time. The following is the function:

import numpy as np
import tensorflow as tf

def shuffle_generator(image, label, seed):
idx = np.arange(len(image))
np.random.default_rng(seed).shuffle(idx)
for i in idx:
yield image[i], label[i]

dataset = tf.data.Dataset.from_generator(
shuffle_generator,
args=[train_image, train_label, 42],
output_signature=(
tf.TensorSpec(shape=(28,28), dtype=tf.uint8),
tf.TensorSpec(shape=(), dtype=tf.uint8)))
print(dataset.element_spec)

This function randomizes the input array by shuffling the index vector. Then it generates one sample at a time. Unlike the previous example, this generator will end when the samples from the array are exhausted.

We create a dataset from the function using from_generator(). We need to provide the name of the generator function (instead of an instantiated generator) and also the output signature of the dataset. This is required because the tf.data.Dataset API cannot infer the dataset spec before the generator is consumed.

Running the above code will print the same spec as before:

(TensorSpec(shape=(28, 28), dtype=tf.uint8, name=None),
TensorSpec(shape=(), dtype=tf.uint8, name=None))

Such a dataset is functionally equivalent to the dataset that we created previously. Hence we can use it for training as before. The following is the complete code:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

(train_image, train_label), (test_image, test_label) = load_data()

def shuffle_generator(image, label, seed):
idx = np.arange(len(image))
np.random.default_rng(seed).shuffle(idx)
for i in idx:
yield image[i], label[i]

dataset = tf.data.Dataset.from_generator(
shuffle_generator,
args=[train_image, train_label, 42],
output_signature=(
tf.TensorSpec(shape=(28,28), dtype=tf.uint8),
tf.TensorSpec(shape=(), dtype=tf.uint8)))

model = Sequential([
Flatten(input_shape=(28,28)),
Dense(100, activation=”relu”),
Dense(100, activation=”relu”),
Dense(10, activation=”sigmoid”)
])

history = model.fit(dataset.batch(32),
epochs=50,
validation_data=(test_image, test_label),
verbose=0)

print(model.evaluate(test_image, test_label))

plt.plot(history.history[‘val_sparse_categorical_accuracy’])
plt.show()

## Dataset with Prefetch

The real benefit of using dataset is to use prefetch().

Using a NumPy array for training is probably the best in performance. However, this means we need to load all data into memory. Using a generator function for training allows us to prepare one batch at a time, which the data can be loaded from disk on demand, for example. However, using a generator function to train a Keras model means either the training loop or the generator function is running at any time. It is not easy to make the generator function and Kerasâ€™ training loop to run in parallel.

Dataset is the API that allows the generator and the training loop to run in parallel. If you have a generator that is computationally expensive (e.g., doing image augmentation at realtime), you can create a dataset from such generator function and then use it with prefetch(), as follows:

history = model.fit(dataset.batch(32).prefetch(3),
epochs=50,
validation_data=(test_image, test_label),
verbose=0)

The number argument to prefetch() is the size of the buffer. Here we ask the dataset to keep 3 batches in memory ready for the training loop to consume. Whenever a batch is consumed, the dataset API will resume the generator function to refill the buffer, asynchronously in background. Therefore we can allow the training loop and the data preparation algorithm inside the generator function to run in parallel.

It worth to mention that, in the previous section, we created a shuffling generator for the dataset API. Indeed the dataset API also has a shuffle() function to do the same but we may not want to use it unless the datset is small enough to fit in memory.

The shuffle() function, same as prefetch(), takes a buffer size argument. The shuffle algorithm will fill the buffer with the dataset and draw one element randomly from it. The consumed element will be replaced with the next element from the dataset. Hence we need the buffer as large as the dataset itself to make a truly random shuffle. We can demonstrate this limitation with the following snippet:

import tensorflow as tf
import numpy as np

n_dataset = tf.data.Dataset.from_tensor_slices(np.arange(10000))
for n in n_dataset.shuffle(10).take(20):
print(n.numpy())

The output from the above looks like the following:

9
6
2
7
5
1
4
14
11
17
19
18
3
16
15
22
10
23
21
13

Which we can see the numbers are shuffled around its neighborhood and we never see large numbers from its output.

More about the tf.data dataset can be found from its API documentation:

tf.data.Dataset API

## Summary

In this post, you have seen how we can use the tf.data dataset and how it can be used in training a Keras model.

Specifically, you learned:

How to train a model using data from NumPy array, a generator, and a dataset
How to create a dataset using a NumPy array or a generator function
How to use prefetch with dataset to make the generator and training loop run in parallel

The post A Gentle Introduction to tensorflow.data API appeared first on Machine Learning Mastery.

RELATED ARTICLES