When an input is very long the RNNs fail because they cannot know the weight of each word in the sentence, for example in a long sentence maybe the words at the beginning have to do with the end but the RNNs do not know this. That is why we use attentional mechanisms that allow the model to know each weight of a token related to the rest of the input.
In self-attention, our goal is to calculate context vectors z(i) for each element x(i) in the input sequence. A context vector can be interpreted as an enriched embedding vector.
Calculate the scores
The first step is calculate the intermediate attention scores between the query token and each input token. We determine these scores by computing the dot product of the query, x(2), with every other input token.
For calculate the scores we need to get the dot products of the vector with the each of the others.
For example, we want to get the attn_scores of the vector 2 in the following tensor:
We do a for loop for get the dot product of the [0.55, 0.87, 0.66] with the other vectors
What I do here is create a empty tensor with the shape of the input (totals rows 6) and then calculate each dot product related to query that is [0.55, 0.87, 0.66]
Calculate the Attention Weights
For calculate the attention weights normalize to 1 each element in atten_scores_2 this means that the sum of element in atten_scores_2 result to 1.
instead of that we can use the torch.sofmax() function
Calculating the context vector
We need to multiplying the embedded input tokens, with the corresponding attention weights and then summing the resulting vectors.
Computing attention weights for all input tokens
Get the same result doing:
Rules for multiplying matrices
To multiply two matrices, say 𝐴 and 𝐵, they must meet the following condition:
The number of columns of A must be equal to the number of rows of B.
If 𝐴 has the shape (𝑚×𝑛) y 𝐵 has the shape (𝑛×𝑝), then the product 𝐶=𝐴⋅𝐵 will have the shape (𝑚×𝑝).
Implementing self-attention with trainable weights
The most popular GPTs models use self-attention mechanism called scaled dot-product attention.
This mechanism introduce weight matrices that are updated during model training. These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce “good” context vectors.
Computing the attention weights
These matrices are used to project the input token into query, key and value vectors.
Input parameters:
d_in: The dimensionality of the input embeddings.
d_out: The dimensionality of the output embeddings.
Weight matrices: - W_query, W_key, and W_value are learnable parameters initialized randomly. These matrices transform the input into query, key, and value vectors. -
The forward method defines how the input tensor x flows through the model.
The input x (shape [batch_size, d_in]) is projected into:
Keys: x @ W_key (shape [batch_size, d_out])
Queries: x @ W_query (shape [batch_size, d_out])
Values: x @ W_value (shape [batch_size, d_out])
A self-attention class using PyTorch’s Linear layers
Causal attention AKA masked attention
It restricts a model to only consider previous and current inputs in sequence when processing any given token when computing attention scores.
for each token processed, we mask out the future tokens, which come after the current token in the input text.
Implementing a simple mask
multiply this mask with the attention weights to zero-out the values above the diagonal:
renormalize the attention weights to sum up to 1
The softmax function takes a vector of values and converts them into a probability distribution. The formula is:
softmax(zi)=∑j=1nezjezi
1. Very negative values (−∞):
e−∞=0, which completely removes the contribution of that position to the attention calculation. This means the masked position will have exactly 000 probability.
2. Zero values (000):
e0=1, meaning the masked position still contributes to the denominator. This can lead to an incorrect probability distribution because positions that should be ignored are still being considered.
The softmax function converts its inputs into a probability distribution. When negative infinity values (−∞) are present in a row, the softmax function treats them as zero probability. (Mathematically, this is because e–∞ approaches 0.) We can implement this more efficient masking “trick” by creating a mask with 1s above the diagonal and then replacing these 1s with negative infinity (-inf) values:
Masking additional attention weights with dropout
Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively “dropping” them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It’s important to emphasize that dropout is only used during training and is disabled afterward.
apply dropout to the attention weight matrix itself:
Implementing a compact causal attention class
We will now incorporate the causal attention and dropout modifications into the SelfAttention Python class.
This results in a three-dimensional tensor consisting of two input texts with six tokens each, where each token is a three-dimensional embedding vector:
Casual Attention Class:
The resulting context vector is a three-dimensional tensor where each token is now represented by a two-dimensional embedding:
Multi-head attention
The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections—the results of multiplying the input data (like the query, key, and value vectors in attention mechanisms) by a weight matrix.
“multi-head” refers to dividing the attention mechanism into multiple
“heads,” each operating independently.
This results in the following tensor representing the context vectors:
Implementing multi-head attention with weight splits
On a big-picture level, in the previous MultiHeadAttentionWrapper, we stacked
multiple single-head attention layers that we combined into a multi-head attention
layer. The MultiHeadAttention class takes an integrated approach. It starts with a
multi-head layer and then internally splits this layer into individual attention heads.
The splitting of the query, key, and value tensors is achieved through tensor reshaping and transposing operations using PyTorch’s.view and .transpose methods. The
input is first transformed (via linear layers for queries, keys, and values) and then
reshaped to represent multiple heads.
The tensors are then transposed to bring the num_heads dimension before the num_ tokens dimension, resulting in a shape of (b, num_heads, num_tokens, head_dim). This transposition is crucial for correctly aligning the queries, keys, and values across the different heads and performing batched matrix multiplications efficiently. To illustrate this batched matrix multiplication, suppose we have the following tensor: