Artificial Intelligence and Machine Learning

Adding A Custom Attention Layer To Recurrent Neural Network In Keras

By mullaned2002

October 19, 2021

2845

Last Updated on October 12, 2021

Deep learning networks have gained immense popularity in the past few years. The ‘attention mechanism’ is integrated with the deep learning networks to improve their performance. Adding attention component to the network has shown significant improvement in tasks such as machine translation, image recognition, text summarization and similar applications.

This tutorial shows how to add a custom attention layer to a network built using a recurrent neural network. We’ll illustrate an end to end application of time series forecasting using a very simple dataset. The tutorial is designed for anyone looking for a basic understanding of how to add user defined layers to a deep learning network and use this simple example to build more complex applications.

After completing this tutorial, you will know:

Which methods are required to create a custom attention layer in Keras
How to incorporate the new layer in a network built with SimpleRNN

Let’s get started.

Adding A Custom Attention Layer To Recurrent Neural Network In Keras
Photo by Yahya Ehsan, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Preparing a simple dataset for time series forecasting
How to use a network built via SimpleRNN for time series forecasting
Adding a custom attention layer to the SimpleRNN network

Prerequisites

It is assumed that you are familiar with the following topics. You can click the links below for an overview.

What is Attention?
The attention mechanism from scratch
An introduction to RNN and the math that powers them
Understanding simple recurrent neural networks in Keras

The Dataset

The focus of this article is to gain a basic understanding of how to build a custom attention layer to a deep learning network. For this purpose, we’ll use a very simple example of a Fibonacci sequence, where one number is constructed from previous two numbers. The first 10 numbers of the sequence are shown below:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …

When given the previous ‘t’ numbers, can we get a machine to accurately reconstruct the next number? This would mean discarding all the previous inputs except the last two and performing the correct operation on the last two numbers.

For this tutorial, we’ll construct the training examples from t time steps and use the value at t+1 as the target. For example, if t=3, then the training examples and the corresponding target values would look as follows:

The SimpleRNN Network

In this section, we’ll write the basic code to generate the dataset and use a SimpleRNN network for predicting the next number of the Fibonacci sequence.

The Import Section

Let’s first write the import section:

from pandas import read_csv
import numpy as np
from keras import Model
from keras.layers import Layer
import keras.backend as K
from keras.layers import Input, Dense, SimpleRNN
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.metrics import mean_squared_error

Preparing The Dataset

The following function generates a sequence of n Fibonacci numbers (not counting the starting two values). If scale_data is set to True, then it would also use the MinMaxScaler from scikit-learn to scale the values between 0 and 1. Let’s see its output for n=10.

def get_fib_seq(n, scale_data=True):
# Get the Fibonacci sequence
seq = np.zeros(n)
fib_n1 = 0.0
fib_n = 1.0
for i in range(n):
seq[i] = fib_n1 + fib_n
fib_n1 = fib_n
fib_n = seq[i]
scaler = []
if scale_data:
scaler = MinMaxScaler(feature_range=(0, 1))
seq = np.reshape(seq, (n, 1))
seq = scaler.fit_transform(seq).flatten()
return seq, scaler

fib_seq = get_fib_seq(10, False)[0]
print(fib_seq)

[ 1. 2. 3. 5. 8. 13. 21. 34. 55. 89.]

Next, we need a function get_fib_XY() that reformats the sequence into training examples and target values to be used by the Keras input layer. When given time_steps as a parameter, get_fib_XY() constructs each row of the dataset with time_steps number of columns. This function not only constructs the training set and test set from the Fibonacci sequence, but also shuffles the training examples and reshapes them to the required TensorFlow format, i.e., total_samples x time_steps x features. Also, the function returns the scaler object that scales the values if scale_data is set to True.

Let’s generate a small training set to see what it looks like. We have set time_steps=3, total_fib_numbers=12, with approximately 70% examples going towards the test points. Note the training and test examples have been shuffled by the permutation() function.

def get_fib_XY(total_fib_numbers, time_steps, train_percent, scale_data=True):
dat, scaler = get_fib_seq(total_fib_numbers, scale_data)
Y_ind = np.arange(time_steps, len(dat), 1)
Y = dat[Y_ind]
rows_x = len(Y)
X = dat[0:rows_x]
for i in range(time_steps-1):
temp = dat[i+1:rows_x+i+1]
X = np.column_stack((X, temp))
# random permutation with fixed seed
rand = np.random.RandomState(seed=13)
idx = rand.permutation(rows_x)
split = int(train_percent*rows_x)
train_ind = idx[0:split]
test_ind = idx[split:]
trainX = X[train_ind]
trainY = Y[train_ind]
testX = X[test_ind]
testY = Y[test_ind]
trainX = np.reshape(trainX, (len(trainX), time_steps, 1))
testX = np.reshape(testX, (len(testX), time_steps, 1))
return trainX, trainY, testX, testY, scaler

trainX, trainY, testX, testY, scaler = get_fib_XY(12, 3, 0.7, False)
print(‘trainX = ‘, trainX)
print(‘trainY = ‘, trainY)

trainX = [[[ 8.]
[13.]
[21.]]

[[ 5.]
[ 8.]
[13.]]

[[ 2.]
[ 3.]
[ 5.]]

[[13.]
[21.]
[34.]]

[[21.]
[34.]
[55.]]

[[34.]
[55.]
[89.]]]
trainY = [ 34. 21. 8. 55. 89. 144.]

Setting Up The Network

Now let’s setup a small network with two layers. The first one being the SimpleRNN layer and the second one being the Dense layer. Below is a summary of the model.

# Set up parameters
time_steps = 20
hidden_units = 2
epochs = 30

# Create a traditional RNN network
def create_RNN(hidden_units, dense_units, input_shape, activation):
model = Sequential()
model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0]))
model.add(Dense(units=dense_units, activation=activation[1]))
model.compile(loss=’mse’, optimizer=’adam’)
return model

model_RNN = create_RNN(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1),
activation=[‘tanh’, ‘tanh’])
model_RNN.summary()

Model: “sequential_1”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn_3 (SimpleRNN) (None, 2) 8
_________________________________________________________________
dense_3 (Dense) (None, 1) 3
=================================================================
Total params: 11
Trainable params: 11
Non-trainable params: 0

Train The Network And Evaluate

The next step is to add code that generates a dataset, trains the network, and evaluates it. This time around, we’ll scale the data between 0 and 1. We don’t need to pass scale_data parameter as its default value is True.

# Generate the dataset
trainX, trainY, testX, testY, scaler = get_fib_XY(1200, time_steps, 0.7)

model_RNN.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

# Evalute model
train_mse = model_RNN.evaluate(trainX, trainY)
test_mse = model_RNN.evaluate(testX, testY)

# Print error
print(“Train set MSE = “, train_mse)
print(“Test set MSE = “, test_mse)

As output you’ll see the progress of training and the following values of mean square error:

Train set MSE = 5.631405292660929e-05
Test set MSE = 2.623497312015388e-05

Adding A Custom Attention Layer To The Network

In Keras, it is easy to create a custom layer that implements attention by subclassing the Layer class. The Keras guide lists down clear steps for creating a new layer via subclassing. We’ll use those guidelines here. All the weights and biases corresponding to a single layer are encapsulated by this class. We need to write the __init__ method as well as override the following methods:

build(): Keras guide recommends adding weights in this method once the size of the inputs is known. This method ‘lazily’ creates weights. The builtin function add_weight() can be used to add weights and biases of the attention layer.
call(): The call() method implements the mapping of inputs to outputs. It should implement the forward pass during training.

The Call Method For Attention Layer

The call method of the attention layer has to compute the alignment scores, weights, and context. You can go through the details of these parameters in Stefania’s excellent article on The Attention Mechanism from Scratch. We’ll implement the Bahdanau attention in our call() method.

The good thing about inheriting a layer from the Keras Layer class and adding the weights via add_weights() method is that weights are automatically tuned. Keras does an equivalent of ‘reverse engineering’ of the operations/computations of the call() method and calculates the gradients during training. It is important to specify trainable=True when adding the weights. You can also add a train_step() method to your custom layer and specify your own method for weight training if needed.

The code below implements our custom attention layer.

# Add attention layer to the deep learning network
class attention(Layer):
def __init__(self,**kwargs):
super(attention,self).__init__(**kwargs)

def build(self,input_shape):
self.W=self.add_weight(name=’attention_weight’, shape=(input_shape[-1],1),
initializer=’random_normal’, trainable=True)
self.b=self.add_weight(name=’attention_bias’, shape=(input_shape[1],1),
initializer=’zeros’, trainable=True)
super(attention, self).build(input_shape)

def call(self,x):
# Alignment scores. Pass them through tanh function
e = K.tanh(K.dot(x,self.W)+self.b)
# Remove dimension of size 1
e = K.squeeze(e, axis=-1)
# Compute the weights
alpha = K.softmax(e)
# Reshape to tensorFlow format
alpha = K.expand_dims(alpha, axis=-1)
# Compute the context vector
context = x * alpha
context = K.sum(context, axis=1)
return context

RNN Network With Attention Layer

Let’s now add an attention layer to the RNN network we created earlier. The function create_RNN_with_attention() now specifies an RNN layer, attention layer and Dense layer in the network. Make sure to set return_sequences=True when specifying the SimpleRNN. This will return the output of the hidden units for all the previous time steps.

Let’s look at a summary of our model with attention.

def create_RNN_with_attention(hidden_units, dense_units, input_shape, activation):
x=Input(shape=input_shape)
RNN_layer = SimpleRNN(hidden_units, return_sequences=True, activation=activation)(x)
attention_layer = attention()(RNN_layer)
outputs=Dense(dense_units, trainable=True, activation=activation)(attention_layer)
model=Model(x,outputs)
model.compile(loss=’mse’, optimizer=’adam’)
return model

model_attention = create_RNN_with_attention(hidden_units=hidden_units, dense_units=1,
input_shape=(time_steps,1), activation=’tanh’)
model_attention.summary()

Model: “model_1”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 20, 1)] 0
_________________________________________________________________
simple_rnn_2 (SimpleRNN) (None, 20, 2) 8
_________________________________________________________________
attention_1 (attention) (None, 2) 22
_________________________________________________________________
dense_2 (Dense) (None, 1) 3
=================================================================
Total params: 33
Trainable params: 33
Non-trainable params: 0
_________________________________________________________________

Train And Evaluate The Deep Learning Network With Attention

It’s time to train and test our model and see how it performs on predicting the next Fibonacci number of a sequence.

model_attention.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

# Evalute model
train_mse_attn = model_attention.evaluate(trainX, trainY)
test_mse_attn = model_attention.evaluate(testX, testY)

# Print error
print(“Train set MSE with attention = “, train_mse_attn)
print(“Test set MSE with attention = “, test_mse_attn)

You’ll see the training progress as output and the following:

Train set MSE with attention = 5.3511179430643097e-05
Test set MSE with attention = 9.053358553501312e-06

We can see that even for this simple example, the mean square error on the test set is lower with the attention layer. You can achieve better results with hyper-parameter tuning and model selection. Do try this out on more complex problems and adding more layers to the network. You can also use the scaler object to scale the numbers back to their original values.

You can take this example one step further by using LSTM instead of SimpleRNN or you can build a network via convolution and pooling layers. You can also change this to an encoder decoder network if you like.

Consolidated Code

The entire code for this tutorial is pasted below if you would like to try it. Note that your outputs would be different from the ones given in this tutorial because of the stochastic nature of this algorithm.

# Prepare data
def get_fib_seq(n, scale_data=True):
# Get the Fibonacci sequence
seq = np.zeros(n)
fib_n1 = 0.0
fib_n = 1.0
for i in range(n):
seq[i] = fib_n1 + fib_n
fib_n1 = fib_n
fib_n = seq[i]
scaler = []
if scale_data:
scaler = MinMaxScaler(feature_range=(0, 1))
seq = np.reshape(seq, (n, 1))
seq = scaler.fit_transform(seq).flatten()
return seq, scaler

def get_fib_XY(total_fib_numbers, time_steps, train_percent, scale_data=True):
dat, scaler = get_fib_seq(total_fib_numbers, scale_data)
Y_ind = np.arange(time_steps, len(dat), 1)
Y = dat[Y_ind]
rows_x = len(Y)
X = dat[0:rows_x]
for i in range(time_steps-1):
temp = dat[i+1:rows_x+i+1]
X = np.column_stack((X, temp))
# random permutation with fixed seed
rand = np.random.RandomState(seed=13)
idx = rand.permutation(rows_x)
split = int(train_percent*rows_x)
train_ind = idx[0:split]
test_ind = idx[split:]
trainX = X[train_ind]
trainY = Y[train_ind]
testX = X[test_ind]
testY = Y[test_ind]
trainX = np.reshape(trainX, (len(trainX), time_steps, 1))
testX = np.reshape(testX, (len(testX), time_steps, 1))
return trainX, trainY, testX, testY, scaler

# Set up parameters
time_steps = 20
hidden_units = 2
epochs = 30

# Create a traditional RNN network
def create_RNN(hidden_units, dense_units, input_shape, activation):
model = Sequential()
model.add(SimpleRNN(hidden_units, input_shape=input_shape, activation=activation[0]))
model.add(Dense(units=dense_units, activation=activation[1]))
model.compile(loss=’mse’, optimizer=’adam’)
return model

model_RNN = create_RNN(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps,1),
activation=[‘tanh’, ‘tanh’])

# Generate the dataset for the network
trainX, trainY, testX, testY, scaler = get_fib_XY(1200, time_steps, 0.7)
# Train the network
model_RNN.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

# Evalute model
train_mse = model_RNN.evaluate(trainX, trainY)
test_mse = model_RNN.evaluate(testX, testY)

# Print error
print(“Train set MSE = “, train_mse)
print(“Test set MSE = “, test_mse)

# Add attention layer to the deep learning network
class attention(Layer):
def __init__(self,**kwargs):
super(attention,self).__init__(**kwargs)

def build(self,input_shape):
self.W=self.add_weight(name=’attention_weight’, shape=(input_shape[-1],1),
initializer=’random_normal’, trainable=True)
self.b=self.add_weight(name=’attention_bias’, shape=(input_shape[1],1),
initializer=’zeros’, trainable=True)
super(attention, self).build(input_shape)

def call(self,x):
# Alignment scores. Pass them through tanh function
e = K.tanh(K.dot(x,self.W)+self.b)
# Remove dimension of size 1
e = K.squeeze(e, axis=-1)
# Compute the weights
alpha = K.softmax(e)
# Reshape to tensorFlow format
alpha = K.expand_dims(alpha, axis=-1)
# Compute the context vector
context = x * alpha
context = K.sum(context, axis=1)
return context

def create_RNN_with_attention(hidden_units, dense_units, input_shape, activation):
x=Input(shape=input_shape)
RNN_layer = SimpleRNN(hidden_units, return_sequences=True, activation=activation)(x)
attention_layer = attention()(RNN_layer)
outputs=Dense(dense_units, trainable=True, activation=activation)(attention_layer)
model=Model(x,outputs)
model.compile(loss=’mse’, optimizer=’adam’)
return model

# Create the model with attention, train and evaluate
model_attention = create_RNN_with_attention(hidden_units=hidden_units, dense_units=1,
input_shape=(time_steps,1), activation=’tanh’)
model_attention.summary()

model_attention.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

# Evalute model
train_mse_attn = model_attention.evaluate(trainX, trainY)
test_mse_attn = model_attention.evaluate(testX, testY)

# Print error
print(“Train set MSE with attention = “, train_mse_attn)
print(“Test set MSE with attention = “, test_mse_attn)

Summary

In this tutorial, you discovered how to add a custom attention layer to a deep learning network using Keras.

Specifically, you learned:

How to override the Keras Layer class.
The method build() is required to add weights to the attention layer.
The call() method is required for specifying the mapping of inputs to outputs of the attention layer.
How to add a custom attention layer to the deep learning network built using SimpleRNN.

Do you have any questions about RNNs discussed in this post? Ask your questions in the comments below and I will do my best to answer.

The post Adding A Custom Attention Layer To Recurrent Neural Network In Keras appeared first on Machine Learning Mastery.

Adding A Custom Attention Layer To Recurrent Neural Network In Keras

Tutorial Overview

Prerequisites

The Dataset

The SimpleRNN Network

The Import Section

Preparing The Dataset

Setting Up The Network

Train The Network And Evaluate

Adding A Custom Attention Layer To The Network

The Call Method For Attention Layer

RNN Network With Attention Layer

Train And Evaluate The Deep Learning Network With Attention

Consolidated Code

Further Reading

Books

Papers

Articles

Summary

Amazon SageMaker inference launches faster auto scaling for generative AI models

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

Evaluate conversational AI agents with Amazon Bedrock

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Generative AI use cases to inspire your Startup or Small Business

3 DataOps principles to increase collaboration between teams

Build a customer-facing app like a SaaS company

POPULAR CATEGORY