Artificial Intelligence and Machine Learning

Using Dataset Classes in PyTorch

By mullaned2002

November 24, 2022

493

Last Updated on November 23, 2022

In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well.
Some of the common steps required for data preprocessing include:

Data normalization: This includes normalizing the data between a range of values in a dataset.
Data augmentation: This includes generating new samples from existing ones by adding noise or shifts in features to make them more diverse.

Data preparation is a crucial step in any machine learning pipeline. PyTorch brings along a lot of modules such as torchvision which provides datasets and dataset classes to make data preparation easy.

In this tutorial we’ll demonstrate how to work with datasets and transforms in PyTorch so that you may create your own custom dataset classes and manipulate the datasets the way you want. In particular, you’ll learn:

How to create a simple dataset class and apply transforms to it.
How to build callable transforms and apply them to the dataset object.
How to compose various transforms on a dataset object.

Note that here you’ll play with simple datasets for general understanding of the concepts while in the next part of this tutorial you’ll get a chance to work with dataset objects for images.

Let’s get started.

Using Dataset Classes in PyTorch
Picture by NASA. Some rights reserved.

Overview

This tutorial is in three parts; they are:

Creating a Simple Dataset Class
Creating Callable Transforms
Composing Multiple Transforms for Datasets

Creating a Simple Dataset Class

Before we begin, we’ll have to import a few packages before creating the dataset class.

import torch
from torch.utils.data import Dataset
torch.manual_seed(42)

We’ll import the abstract class Dataset from torch.utils.data. Hence, we override the below methods in the dataset class:

__len__ so that len(dataset) can tell us the size of the dataset.
__getitem__ to access the data samples in the dataset by supporting indexing operation. For example, dataset[i] can be used to retrieve i-th data sample.

Likewise, the torch.manual_seed() forces the random function to produce the same number every time it is recompiled.

Now, let’s define the dataset class.

class SimpleDataset(Dataset):
# defining values in the constructor
def __init__(self, data_length = 20, transform = None):
self.x = 3 * torch.eye(data_length, 2)
self.y = torch.eye(data_length, 4)
self.transform = transform
self.len = data_length

# Getting the data samples
def __getitem__(self, idx):
sample = self.x[idx], self.y[idx]
if self.transform:
sample = self.transform(sample)
return sample

# Getting data size/length
def __len__(self):
return self.len

In the object constructor, we have created the values of features and targets, namely x and y, assigning their values to the tensors self.x and self.y. Each tensor carries 20 data samples while the attribute data_length stores the number of data samples. Let’s discuss about the transforms later in the tutorial.

The behavior of the SimpleDataset object is like any Python iterable, such as a list or a tuple. Now, let’s create the SimpleDataset object and look at its total length and the value at index 1.

dataset = SimpleDataset()
print(“length of the SimpleDataset object: “, len(dataset))
print(“accessing value at index 1 of the simple_dataset object: “, dataset[1])

This prints

length of the SimpleDataset object: 20
accessing value at index 1 of the simple_dataset object: (tensor([0., 3.]), tensor([0., 1., 0., 0.]))

As our dataset is iterable, let’s print out the first four elements using a loop:

for i in range(4):
x, y = dataset[i]
print(x, y)

This prints

tensor([3., 0.]) tensor([1., 0., 0., 0.])
tensor([0., 3.]) tensor([0., 1., 0., 0.])
tensor([0., 0.]) tensor([0., 0., 1., 0.])
tensor([0., 0.]) tensor([0., 0., 0., 1.])

Creating Callable Transforms

In several cases, you’ll need to create callable transforms in order to normalize or standardize the data. These transforms can then be applied to the tensors. Let’s create a callable transform and apply it to our “simple dataset” object we created earlier in this tutorial.

# Creating a callable tranform class mult_divide
class MultDivide:
# Constructor
def __init__(self, mult_x = 2, divide_y = 3):
self.mult_x = mult_x
self.divide_y = divide_y

# caller
def __call__(self, sample):
x = sample[0]
y = sample[1]
x = x * self.mult_x
y = y / self.divide_y
sample = x, y
return sample

We have created a simple custom transform MultDivide that multiplies x with 2 and divides y by 3. This is not for any practical use but to demonstrate how a callable class can work as a transform for our dataset class. Remember, we had declared a parameter transform = None in the simple_dataset. Now, we can replace that None with the custom transform object that we’ve just created.

So, let’s demonstrate how it’s done and call this transform object on our dataset to see how it transforms the first four elements of our dataset.

# calling the transform object
mul_div = MultDivide()
custom_dataset = SimpleDataset(transform = mul_div)

for i in range(4):
x, y = dataset[i]
print(‘Idx: ‘, i, ‘Original_x: ‘, x, ‘Original_y: ‘, y)
x_, y_ = custom_dataset[i]
print(‘Idx: ‘, i, ‘Transformed_x:’, x_, ‘Transformed_y:’, y_)

This prints

Idx: 0 Original_x: tensor([3., 0.]) Original_y: tensor([1., 0., 0., 0.])
Idx: 0 Transformed_x: tensor([6., 0.]) Transformed_y: tensor([0.3333, 0.0000, 0.0000, 0.0000])
Idx: 1 Original_x: tensor([0., 3.]) Original_y: tensor([0., 1., 0., 0.])
Idx: 1 Transformed_x: tensor([0., 6.]) Transformed_y: tensor([0.0000, 0.3333, 0.0000, 0.0000])
Idx: 2 Original_x: tensor([0., 0.]) Original_y: tensor([0., 0., 1., 0.])
Idx: 2 Transformed_x: tensor([0., 0.]) Transformed_y: tensor([0.0000, 0.0000, 0.3333, 0.0000])
Idx: 3 Original_x: tensor([0., 0.]) Original_y: tensor([0., 0., 0., 1.])
Idx: 3 Transformed_x: tensor([0., 0.]) Transformed_y: tensor([0.0000, 0.0000, 0.0000, 0.3333])

As you can see the transform has been successfully applied to the first four elements of the dataset.

Composing Multiple Transforms for Datasets

We often would like to perform multiple transforms in series on a dataset. This can be done by importing Compose class from transforms module in torchvision. For instance, let’s say we build another transform SubtractOne and apply it to our dataset in addition to the MultDivide transform that we have created earlier.

Once applied, the newly created transform will subtract 1 from each element of the dataset.

from torchvision import transforms

# Creating subtract_one tranform
class SubtractOne:
# Constructor
def __init__(self, number = 1):
self.number = number

# caller
def __call__(self, sample):
x = sample[0]
y = sample[1]
x = x – self.number
y = y – self.number
sample = x, y
return sample

As specified earlier, now we’ll combine both the transforms with Compose method.

# Composing multiple transforms
mult_transforms = transforms.Compose([MultDivide(), SubtractOne()])

Note that first MultDivide transform will be applied onto the dataset and then SubtractOne transform will be applied on the transformed elements of the dataset.
We’ll pass the Compose object (that holds the combination of both the transforms i.e. MultDivide() and SubtractOne()) to our SimpleDataset object.

# Creating a new simple_dataset object with multiple transforms
new_dataset = SimpleDataset(transform = mult_transforms)

Now that the combination of multiple transforms has been applied to the dataset, let’s print out the first four elements of our transformed dataset.

for i in range(4):
x, y = dataset[i]
print(‘Idx: ‘, i, ‘Original_x: ‘, x, ‘Original_y: ‘, y)
x_, y_ = new_dataset[i]
print(‘Idx: ‘, i, ‘Transformed x_:’, x_, ‘Transformed y_:’, y_)

Putting everything together, the complete code is as follows:

import torch
from torch.utils.data import Dataset
from torchvision import transforms

torch.manual_seed(2)