Last Updated on October 4, 2022
Having familiarised ourselves with the theory behind the Transformer model and its attention mechanism, we’ll be starting our journey of implementing a complete Transformer model by first seeing how to implement the scaled-dot product attention. The scaled dot-product attention is an integral part of the multi-head attention, which in turn, is an important component of both the Transformer encoder and decoder. Our end goal will be the application of the complete Transformer model to Natural Language Processing (NLP).
In this tutorial, you will discover how to implement scaled dot-product attention from scratch in TensorFlow and Keras.Â
After completing this tutorial, you will know:
The operations that form part of the scaled dot-product attention mechanism.
How to implement the scaled dot-product attention mechanism from scratch. Â
Let’s get started.Â
Tutorial Overview
This tutorial is divided into three parts; they are:
Recap of the Transformer Architecture
The Transformer Scaled Dot-Product Attention
Implementing the Scaled Dot-Product Attention From Scratch
Testing Out the Code
Prerequisites
For this tutorial, we assume that you are already familiar with:
The concept of attention
The attention mechanism
The Transfomer attention mechanism
The Transformer model
Recap of the Transformer Architecture
Recall having seen that the Transformer architecture follows an encoder-decoder structure: the encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.
In generating an output sequence, the Transformer does not rely on recurrence and convolutions.
We had seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. One of the core components that both the encoder and decoder share within their multi-head attention blocks, is the scaled dot-product attention.Â
The Transformer Scaled Dot-Product Attention
First, recall the queries, keys and values, as the important components that we shall be working with.Â
In the encoder stage, they each carry the same input sequence after this has been embedded and augmented by positional information. Similarly on the decoder side, the queries, keys and values fed into the first attention block represent the same target sequence, after this would have also been embedded and augmented by positional information. The second attention block of the decoder receives the encoder output in the form of keys and values, and the normalized output of the first attention block as the queries. The dimensionality of the queries and keys is denoted by $d_k$, whereas the dimensionality of the values is denoted by $d_v$.
The scaled dot-product attention receives these queries, keys and values as inputs, and first computes the dot-product of the queries with the keys. The result is subsequently scaled by the square root of $d_k$ producing the attention scores, and then fed into a softmax function, obtaining a set of attention weights in doing so. Finally, the attention weights are used to scale the values through a weighted multiplication operation. This entire process can be explained mathematically as follows, where $mathbf{Q}$, $mathbf{K}$ and $mathbf{V}$ denote the queries, keys and values, respectively:
$$text{attention}(mathbf{Q}, mathbf{K}, mathbf{V}) = text{softmax} left( frac{mathbf{Q} mathbf{K}^mathsf{T}}{sqrt{d_k}} right) mathbf{V}$$
Each multi-head attention block in the Transformer model implements a scaled dot-product attention operation as shown below:
You may note that the scaled dot-product attention can also apply a mask to the attention scores before feeding them into the softmax function.Â
Since the word embeddings are zero-padded to a specific sequence length, a padding mask needs to be introduced in order to prevent the zero tokens from being processed along with the input in both the encoder and decoder stages. Furthermore, a look-ahead mask is also required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.
These look-ahead and padding masks are applied inside the scaled dot-product attention, in order to set to -$infty$ all values in the input to the softmax function that should not be considered. For each of these large negative inputs, the softmax function will, in turn, produce an output value that is close to zero, effectively masking them out. The use of these masks will become clearer when we progress to the implementation of the encoder and decoder blocks in separate tutorials.Â
For the time being, let’s see how to implement the scaled dot-product attention from scratch in TensorFlow and Keras.
Implementing the Scaled Dot-Product Attention From Scratch
For this purpose, we will be creating a class called DotProductAttention that inherits form the Layer base class in Keras.Â
In it we will be creating the class method, call(), that takes as input arguments the queries, keys and values, as well as the dimensionality, $d_k$, and a mask (that defaults to None):
class DotProductAttention(Layer):
def __init__(self, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
def call(self, queries, keys, values, d_k, mask=None):
…
The first step is to perform a dot-product operation between the queries and the keys, transposing the latter. The result will be scaled through a division by the square root of $d_k$. We will add the following line of code to the call() class method:
…
scores = matmul(queries, keys, transpose_b=True) / sqrt(d_k)
…
Next, we will be checking whether the mask argument has been set to a value that is not the default None.Â
The mask will contain either 0Â values, to indicate that the corresponding token in the input sequence should be considered in the computations, or a 1Â to indicate otherwise. The mask will be multiplied by -1e9 to set the 1Â values to large negative numbers (remember having mentioned this in the previous section), and subsequently applied to the attention scores:
…
if mask is not None:
scores += -1e9 * mask
…
The attention scores will then be passed through a softmax function to generate the attention weights:
…
weights = softmax(scores)
…
The final step weights the values with the computed attention weights, through another dot-product operation:
…
return matmul(weights, values)
The complete code listing is as follows:
from tensorflow import matmul, math, cast, float32
from tensorflow.keras.layers import Layer
from keras.backend import softmax
# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):
def __init__(self, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
def call(self, queries, keys, values, d_k, mask=None):
# Scoring the queries against the keys after transposing the latter, and scaling
scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))
# Apply mask to the attention scores
if mask is not None:
scores += -1e9 * mask
# Computing the weights by a softmax operation
weights = softmax(scores)
# Computing the attention by a weighted sum of the value vectors
return matmul(weights, values)
Testing Out the Code
We will be working with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):
d_k = 64 # Dimensionality of the linearly projected queries and keys
d_v = 64 # Dimensionality of the linearly projected values
batch_size = 64 # Batch size from the training process
…
As for the sequence length, and the queries, keys and values, we will be working with dummy data for the time being until we arrive to the stage of training the complete Transformer model in a separate tutorial, at which point we will be using actual sentences. Similarly for the mask, we will leave it set to its default value for the time being:
…
input_seq_length = 5 # Maximum length of the input sequence
queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
…
In the complete Transformer model, values for the sequence length, and the queries, keys and values will be obtained through a process of word tokenization and embedding. We will be covering this in a separate tutorial.Â
Returning back to our testing procedure, the next step is to create a new instance of the DotProductAttention class, assigning its output to the attention variable:
…
attention = DotProductAttention()
…
Since the DotProductAttention class inherits from the Layer base class, the call() method of the former will be automatically invoked by the magic __call()__ method of the latter. The final step is to feed in the input arguments and printing the result:
…
print(attention(queries, keys, values, d_k))
Tying everything together produces the following code listing:
from numpy import random
input_seq_length = 5 # Maximum length of the input sequence
d_k = 64 # Dimensionality of the linearly projected queries and keys
d_v = 64 # Dimensionality of the linearly projected values
batch_size = 64 # Batch size from the training process
queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
attention = DotProductAttention()
print(attention(queries, keys, values, d_k))
Running this code produces an output of shape, (batch size, sequence length, values dimensionality). Note that you will likely see a different output due to the random initialization of the queries, keys and values.
tf.Tensor(
[[[0.60413814 0.52436507 0.46551135 … 0.5260341 0.33879933 0.43999898]
[0.60433316 0.52383804 0.465411 … 0.5262608 0.33915892 0.43782598]
[0.62321603 0.5349194 0.46824688 … 0.531323 0.34432083 0.43554053]
[0.60013235 0.54162943 0.47391182 … 0.53600514 0.33722004 0.4192218 ]
[0.6295709 0.53511244 0.46552944 … 0.5317217 0.3462567 0.43129003]]
…
[[0.20291057 0.18463902 0.641182 … 0.4706118 0.4194418 0.39908117]
[0.19932748 0.18717204 0.64831126 … 0.48373622 0.3995132 0.37968236]
[0.20611541 0.18079443 0.6374859 … 0.48258874 0.41704425 0.4016996 ]
[0.19703123 0.18210654 0.6400498 … 0.47037745 0.4257752 0.3962079 ]
[0.19237372 0.18474475 0.64944196 … 0.49497223 0.38804317 0.36352912]]],
shape=(64, 5, 64), dtype=float32)
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Advanced Deep Learning with Python, 2019.
Transformers for Natural Language Processing, 2021.Â
Papers
Attention Is All You Need, 2017.
Summary
In this tutorial, you discovered how to implement scaled dot-product attention from scratch in TensorFlow and Keras.
Specifically, you learned:
The operations that form part of the scaled dot-product attention mechanism.
How to implement the scaled dot-product attention mechanism from scratch.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post How to Implement Scaled Dot-Product Attention From Scratch in TensorFlow and Keras appeared first on Machine Learning Mastery.
Read MoreMachine Learning Mastery