Transformers for NLP
- May 19, 2020
- Posted by: vsinghal
- Category: Natural Language Processing
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops
Last Saturday, our AI Lab Researcher Indrajit Singh presented an exhaustive webinar on Transformers, which are used in NLP.
To combine the advantages from both CNNs and RNNs, [Vaswani et al., 2017] designed a novel architecture using the attention mechanism.
This architecture, which is called as Transformer, achieves parallelization by capturing recurrence sequence with attention and at the same time encodes each item’s position in the sequence. As a result, Transformer leads to a compatible model with significantly shorter training time.
Similar to the seq2seq model, Transformer is also based on the encoder-decoder architecture. However, Transformer differs from the former by replacing the recurrent layers in seq2seq with multi-head attention layers, incorporating the position-wise information through position encoding, and applying layer normalization. We compare Transformer and seq2seq side-by-side.
A recurrent layer in seq2seq is replaced by a Transformer block. This block contains a multi-head attention layer and a network with two position-wise feed-forward network layers for the encoder.
For the decoder, another multi-head attention layer is used to take the encoder state.
Add and norm:
The inputs and outputs of both the multi-head attention layer or the position-wise feed-forward network, are processed by two “add and norm” layer that contains a residual structure and a layer normalization layer.
Since the self-attention layer does not distinguish the item order in a sequence, a positional encoding layer is used to add sequential information into each sequence item.
Transformer and Seq2seq
Overall, these two models are similar to each other: the source sequence embeddings are fed into n repeated blocks.
The outputs of the last block are then used as attention memory for the decoder.
The target sequence embeddings are similarly fed into n repeated blocks in the decoder, and the final outputs are obtained by applying a dense layer with vocabulary size to the last block’s outputs.
The self-attention model is a normal attention model, with its query, its key, and its value being copied exactly the same from each item of the sequential inputs.
As we illustrate in the image below, self-attention outputs a same-length sequential output for each input item. Compared with a recurrent layer, output items of a self-attention layer can be computed in parallel and, therefore, it is easy to obtain a highly-efficient implementation.
Multi–head attention layer :-
The multi-head attention layer consists of h parallel self-attention layers, each one is called a head. For each head, before feeding into the attention layer, we project the queries, keys and values with three dense layers with hidden sizes pq, pk, pv respectively. The output of this h attention heads are concatenated and then passed by a final dense layer.
Assume that the dimension for a query, a key, and a value are dq, dk, dv respectively. Then, for each head i = 1, ………h we can train learnable parameters :
Therefore, the output for each head is :
Where attention can be any attention layer, such as the DotProductAttention and MLPAttention.
After that, the output with length pv from each of the h attention heads are concatenated to be an output of length hpv, which is then passed the final dense layer with do hidden units. The weights of this dense layer can be denoted by
As a result, the multi-head attention output will be
Multi-head Attention class :-
Assume that the multi-head attention contain the number heads num_heads = h, the hidden size num_hiddens = pq = pk = pv are the same for the query, key, and value dense layers. In addition, since the multi-head attention keeps the same dimensionality between its input and its output, we have the output feature size do = num_hiddens as well.
definition of forward :-
output shape :-
definition of the transpose functions :-
MultiHeadAttention model : toy example
Let us test the MultiHeadAttention model in the a toy example.
Create a multi-head attention with the hidden size do = 100 , the output will share the same batch size and sequence length as the input, but the last dimension will be equal to the num_hiddens =100.
Position-wise Feed-Forward Networks
Another key component in the Transformer block is called position-wise feed-forward network (FFN).
It accepts a 3 -dimensional input with shape (batch size, sequence length, feature size).
The position-wise FFN consists of two dense layers that applies to the last dimension. Since the same two dense layers are used for each position item in the sequence, we referred to it as position-wise. Indeed, it is equivalent to applying two 1×1 convolution layers.
Let’s observe how to implement a position-wise FFN with two dense layers of hidden size ffn_num_hiddens and pw_num_outputs, respectively.
Similar to the multi-head attention, the position-wise feed-forward network will only change the last dimension size of the input—the feature dimension.
In addition, if two items in the input sequence are identical, the according outputs will be identical as well.
Add and Norm :-
Another important block that play a key role in Transformer block is the “add and norm” within the block which is used to connect the inputs and outputs of other layers smoothly.
For understanding, we add a layer that contains a residual structure and a layer normalization after both the multi-head attention layer and the position-wise FFN network. Layer normalization is similar to batch normalization.
One difference is that the mean and variances for the layer normalization are calculated along the last dimension, e.g X.mean (axis=-1) instead of the first batch dimension, e.g., X.mean (axis=0).
Layer normalization prevents the range of values in the layers from changing too much, which allows faster training and better generalization ability.
Example Code: MXNet has both LayerNorm and Batch Norm implemented within the nn block. Let us call both of them and see the difference in the example below.
Let’s implement the connection block AddNorm together.
AddNorm accepts two inputs X and Y . We can deem X as the original input in the residual network, and Y as the outputs from either the multi-head attention layer or the position-wise FFN network. In addition, we apply dropout on Y for regularization.
Due to the residual connection, X and Y should have the same shape.
Recap of Skip connections and Residual blocks
Usually, a deep learning model learns the mapping, M, from an input x to an output y i.e.
Instead of learning a direct mapping, the residual function uses the difference between a mapping applied to x and the original input, x i.e.
The skip layer connection is used i.e.
The Residual block looks like this :-
Unlike the recurrent layer, both the multi-head attention layer and the position-wise feed-forward network compute the output of each item in the sequence independently.
- This feature enables us to parallelize the computation, but it fails to model the sequential information for a given sequence. To better capture the sequential information, the Transformer model uses the positional encoding to maintain the positional information of the input sequence.
- For better understanding let’s assume X ∈ Rlxd is the embedding of an example, where l is the sequence length and d is the embedding size.
- This positional encoding layer encodes X’s position P ∈ Rlxd and outputs P+X.
The position P is a 2-D matrix, where i refers to the order in the sentence, and j refers to the position along the embedding vector dimension. In this way, each value in the origin sequence is then maintained using the equations below :-
The figure below illustrates the positional encoding :
Positional Encoding class :-
Let’s test the PositionalEncoding class with a toy model for 4 dimensions. As we can see, the 4th dimension has the same frequency as the 5th but with different offset. The 5th and 6th dimensions have a lower frequency.
The encoder simply takes the input data, and train on it then it passes the last state of its recurrent layer as an initial state to the first recurrent layer of the decoder part.
The word embeddings of the input sequence are passed to the first encoder.
These are then transformed and propagated to the next encoder. (Multi Layer Encoder).
The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in the figure below:
Let us now build the Transformer encoder block.
This encoder contains a multi-head attention layer, a position-wise feed-forward network, and two “add and norm” connection blocks.
As shown in the code, for both of the attention model and the positional FFN model in the EncoderBlock, their outputs’ dimension are equal to the num_hiddens. This is due to the nature of the residual block, as we need to add these outputs back to the original value during “add and norm”.
Due to the residual connections, the encoder block will not change the input shape.
It simply means that the num_hiddens argument should be equal to the input size of the last dimension. In our toy example below, num_hiddens =24 , ffn_num_hiddens =48 , num_heads =8 , and dropout =0.5.
Entire Transformer Encoder :-
With the Transformer encoder, n blocks of EncoderBlock stack up one after another.
Note: Because of the residual connection, the embedding layer size d is same as the Transformer block output size.
Also note that we multiply the embedding output by √d to prevent its values from being too small.
Let us create an encoder with two stacked Transformer encoder blocks, whose hyperparameters are the same as before.
Similar to the previous toy example’s parameters, we add two more parameters vocab_size to be 200 and num_layers to be 2 here.
Transformer Decoder is almost similar to the Encoder block.
Besides the two sub-layers (the multi-head attention layer and the positional encoding network), the decoder Transformer block contains a third sub-layer, which applies multi-head attention on the output of the encoder stack.
Similar to the Transformer encoder block, the Transformer decoder block employs “add and norm”, i.e., the residual connections and the layer normalization to connect each of the sub-layers.
To be specific, at timestep t , assume that Xt is the current input, i.e., the query. As illustrated in the figure, the keys and values of the self-attention layer consist of the current query with all the past queries X1, …., Xt-1.
During training, the output for the t-query could observe all the previous key-value pairs. It results in an different behavior from prediction.
Thus, during prediction we can eliminate the unnecessary information by specifying the valid length to be t for the tth query.
Decoder class :-
defining the forward function :-
Similar to the Transformer encoder block, num_hiddens should be equal to the last dimension size of X
The construction of the entire Transformer decoder is identical to the Transformer encoder, except for the additional dense layer to obtain the output confidence scores.
Transformer Decoder class :-
Implementing the class for Transformer decoder TransformerDecoder.
Besides the regular hyperparameters such as the vocab_size and num_hiddens, the Transformer decoder also needs the Transformer encoder’s outputs enc_outputs and env_valid_len.
The Final Linear and Softmax Layer
The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
Limitations and Conclusion
Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share of limitations:
- Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input. This chunking of text causes context fragmentation.
- For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary.
- The Transformer model is based on the encoder-decoder architecture.
- Multi-head attention layer contains h parallel attention layers.
- Position-wise feed-forward network consists of two dense layers that apply to the last dimension.
- Layer normalization differs from batch normalization by normalizing along the last dimension (the feature dimension) instead of the first (batch size) dimension.
- Positional encoding is the only place that adds positional information to the Transformer model.
1.Attention Is All You Need : https://arxiv.org/pdf/1706.03762.pdf
2. Training Tips for the Transformer Model https://arxiv.org/pdf/1804.00247.pdf
3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/pdf/1810.04805.pdf
4. Transformers http://gluon.ai
5. Colab Notebook tensorflow: https://colab.research.google.com/drive/1L2PfWTRvmV0vNyPxrl5HKMWaRjKfe0re?usp=sharing
6. Colab Notebook Mxnet: https://colab.research.google.com/drive/1zwvvNSAaRHCH0fRVQbigou4LXIsN21d_?usp=sharing