## Research/Blog

# Sequence to Sequence Learning and Attention in Neural Networks

- May 7, 2020
- Posted by: vsinghal
- Category: Natural Language Processing

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling*

Last Saturday 2 May 2020, our AI Lab Researcher **Indrajit Singh** presented a fabulous session on “** Sequence to Sequence Learning and Attention in Neural Networks**“.

## Sequence to Sequence Model :-

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French).

In the general case, input sequences and output sequences have different lengths (e.g. machine translation) and the entire input sequence is required in order to start predicting the target. This requires a more advanced setup, which is what people commonly refer to when mentioning “sequence to sequence models” with no further context.

- A RNN layer (or stack thereof) acts as “
**encoder**“: it processes the input sequence and returns its own internal state. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the “context”, or “conditioning”, of the decoder in the next step. - Another RNN layer (or stack thereof) acts as “
**decoder**“: it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep in the future, a training process called “teacher forcing” in this context.

Importantly, the encoder uses as initial state ; the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate. Effectively, the decoder learns to generate targets [t+1…] given targets […t], conditioned on the input sequence.

### The RNN model :-

In the above diagram, a chunk of neural network, A, looks at some input *X*_{t} and outputs a value *h*_{t}. A loop allows information to be passed from one step of the network to the next.

The Recurrent Neural Network (RNN) is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs *(x1, . . . , xT )*, a standard RNN computes a sequence of outputs *(y1, . . . , yT )* by iterating the following equation:

Here, *h*_{t} is the new state *h*_{t-1} is the previous state while *X*_{t} is the current input.

Let’s say that the activation function is sigmoid , the weight at the recurrent neuron is *W*^{hh} and the weight at the input neuron is *w*^{hx}, we can write the equation for the state at time *t* as –

The Recurrent neuron in this case is just taking the immediate previous state into consideration. For longer sequences the equation can involve multiple such states. Once the final state is calculated we can go on to produce the output.

Once the current state is calculated we can calculate the output state as-

To summarize :-

- A single time step of the input is supplied to the network i.e.
*X*_{t}is supplied to the network. - We then calculate its current state using a combination of the current input and the previous state i.e. we calculate
*h*_{t}. - The current
*h*_{t}becomes*h*_{t-1}for the next time step - We can go as many time steps as the problem demands and combine the information from all the previous states
- Once all the time steps are completed the final current state is used to calculate the output
*y*_{t} - The output is then compared to the actual output and the error is generated
- The error is then backpropagated to the network to update the weights(we shall go into the details of backpropagation in further sections) and the network is trained

### The LSTM model :-

In this equation, each *p(y*_{t}* | v, y*_{1}*, …., y*_{T}*)* distribution is represented with a softmax over all the words in the vocabulary.

First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously.

Second, The reverse of the order of the words of the input sentence is valuable. So for example, instead of mapping the sentence *a, b, c* to the sentence *α, β, γ*, the LSTM is asked to map *c, b, a* to *α, β, γ*, where *α, β, γ* is the translation of *a, b, c*. This way, *a* is in close proximity to *α*, *b* is fairly close to *β*, and so on, a fact that makes it easy for SGD to “establish communication” between the input and the output.

### The S2S model :-

s2s neural network based on RNN encoder and decoder. An inputs sequence is input one symbol at a time to the encoder RNN network (blue) to produce a sequence vector *Se*. The decoder is auto-regressive and takes the previous decoder output and the *Se* to produce one output symbol at a time.

The RNN encoder in the figure has two layer, one layer is an **embedding **layer, that takes inputs and translates them to a fixed code vector. This is called input embedding. The encoder produces a sequence vector Se. The decoder takes the vector Se and previous decoder output embedding to produce one output symbol at a time. The decoder in the figure has two layers, lighter green for the RNN network and darker green for the output network.

To summarize :-

- Encode the input sequence into state vectors.
- Start with a target sequence of size 1 (just the start-of-sequence character)
- Feed the state vectors and 1-char target sequence to the decoder to produce predictions for the next character.
- Sample the next character using these predictions (we simply use argmax).
- Append the sampled character to the target sequence
- Repeat until we generate the end-of-sequence character or we hit the character limit.

Training Steps for a Seq-2-Seq Model :-

## Attention :-

In psychology, **attention **is the cognitive process of selectively concentrating on one or a few things while ignoring others.

Though the basic RNN model struggles with longer sequences but the special variant LSTM can work better achieving remarkable results. Such model have been found to be very powerful and as this has happened, we’ve seen a growing number of attempts to augment RNNs with new properties.

Here are some examples of Recurrent Neural Networks augmented with Attention :

### Attentional Interfaces :-

Let’s look at some examples :

- When we translate a sentence, let’s say you pay special attention to the word you are presently translating.
- When you’re transcribing an audio recording, you listen carefully to the segment you’re actively writing down.
- And if you ask me to describe the room I am presenting from, I will glance around the sentence I am describing from.

Neural networks can achieve this same behavior using attention, focusing on part of a subset of the information they’re given. For example, an RNN can attend over the output of another RNN. At every time step, it focuses on different positions in the other RNN.

### Attention Model :-

An attention model is a method that takes ** n** arguments

*y*

_{1}

*… … y*

_{n}and a context C. It returns a vector Z which is the summary of the

*y*

_{i}focusing on the information linked to context C. More formally, it returns a weighted arithmetic mean of the

*y*

_{i}and the weights are chosen according to the relevance of each

*y*

_{i}given the context C.

Implementation of the attention model is shown below :

Notice that *m*_{i}* = tanh(W1 c + W2 y*_{i}*)*, meaning that both *y*_{i} and *c* are linearly combined.

Both the last 2 figures above implement **“soft” attention**. Whereas, **hard attention** is implemented by randomly picking one of the inputs *y*_{i} with probability *s*_{i}. This is a rougher choice than the averaging of soft attention. Soft attention use is preferred because it can be trained with back-propagation.

### Understanding the Attention Mechanism :-

This is the diagram of the Attention model shown in Bahdanau’s paper. The Bidirectional LSTM used here generates a sequence of annotations (h1, h2,….., hTx) for each input sentence. All the vectors h1, h2.., etc., used in their work are basically the concatenation of forward and backward hidden states in the encoder.

All the vectors h1, h2, h3…., hTx are representations of Tx number of words in the input sentence. In the simple encoder and decoder model, only the last state of the encoder LSTM was used (hTx in this case) as the context vector.

### Weight Calculation :-

Let’s’ discuss how the weights are calculated.

The context vector *c*_{i} for the output word *y*_{i} is generated using the weighted sum of the annotations:

The weights *a*_{ij} are computed by a Softmax function given by the following equation:

where *e*_{ij} is the output score of a feedforward neural network described by the function a that attempts to capture the alignment between input at *j *and output at *i*.

Basically, if the encoder produces Tx number of “annotations” (the hidden state vectors) each having dimension d, then the input dimension of the feedforward network is (Tx , 2d) (assuming the previous state of the decoder also has d dimensions and these two vectors are concatenated).

This input is multiplied with a matrix Wa of (2d, 1) dimensions (of course followed by addition of the bias term) to get scores *e*_{ij} (having a dimension (Tx , 1)).

On the top of these *e*_{ij} scores, a tan hyperbolic function is applied followed by a Softmax to get the normalized alignment scores for output *j*:

So, *α* is a (Tx, 1) dimensional vector and its elements are the weights corresponding to each word in the input sentence.

Let *α* is [0.2, 0.3, 0.3, 0.2] and the input sentence is “I am doing it”. Here, the context vector corresponding to it will be:

### Attention for translation :-

Translation is achieved by stacking multiple layer of attention modules, and with an architecture such as this one:

The module has let’s say Q = query, K = key, V = values. Q and K are multiplied together and scaled to compute a “**similarity metric**”. This metric produces a weight that modulates the values V.

### Multi-head attention :-

Multiple attention heads are used in parallel to focus on different parts of a sequence in parallel. Here V, Q, K are projected with neural network layers to another space, so they can be scaled and mixed.

The question arises that why do we want to use attention-based neural network instead of the RNN/LSTM we have been using so far ? The reason is that because **attention-based network uses a lot less computation** !

**References :-**

- https://arxiv.org/pdf/1409.3215.pdf Seq-2-Seq 2014, paper
- http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ – Attention and Augmented Recurrent Neural Networks
- https://distill.pub/2016/augmented-rnns/
- Analytics Vidya intro to RNN : https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/
- Attention-is-all-you-need https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- Neural Machine Translation by jointly learning to align and translate – https://arxiv.org/pdf/1409.0473.pdf