Research/Blog
Natural Language Processing with BERT
- June 2, 2020
- Posted by: vsinghal
- Category: Natural Language Processing
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops
Last Saturday, AI Researcher Indrajit Singh presented a marvellous workshop on “Bidirectional Encoder Representations from Transformers” – also called “BERT” in short.
We are in the era of Pre-trained models in AI.
The figure below shows tons of pre-trained models which have taken root.

BERT is one of the core models in this scheme.

The intuition behind pre-trained language models is to create a black box which understands the language and can then be asked to do any specific task in that language. The idea is to create the machine equivalent of a ‘well-read’ human being.
The three research papers that are at the core of this new trend in NLP.
ULMFiT- Universal Language Model Fine-Tuning method: The first effective approach to fine-tuning the language model. The authors demonstrate the importance of several novel techniques, including discriminative fine-tuning, slanted triangular learning rate, and gradual unfreezing, for retaining previous knowledge and avoiding catastrophic forgetting during fine-tuning.
ELMo- Embeddings from Language Models: Takes the entire context into consideration. In particular, they are created as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus. Furthermore, ELMo representations are based on characters so that the network can understand even out-of-vocabulary tokens unseen in training.
BERT: Bidirectional Encoder Representations from Transformers, is a new cutting-edge model that considers the context from both the left and the right sides of each word. The two key success factors are (1) masking part of input tokens to avoid cycles where words indirectly “see themselves”, and (2) pre-training a sentence relationship model. Finally, BERT is also a very big model trained on a huge word corpus.
What is the significance of using a Pre-trained model ?
- Instead of training the model from scratch, you can use another pre-trained model as the basis and only fine-tune it to solve the specific NLP task.
- Using pre-trained models allows you to achieve the same or even better performance much faster and with much less labeled data.
ELMo :-
ELMo was the NLP community’s response to the problem of Polysemy – same words having different meanings based on their context. From training shallow feed-forward networks (Word2vec), we graduated to training word embeddings using layers of complex Bi-directional LSTM architectures. This meant that the same word can have multiple ELMO embeddings based on the context it is in.
“ That’s when we started seeing the advantage of pre-training as a training mechanism for NLP ”

ULMFiT Approach :-
ULMFiT took this a step further. This framework could train language models that could be fine-tuned to provide excellent results even with fewer data (less than 100 examples) on a variety of document classification tasks. It is safe to say that ULMFiT cracked the code to transfer learning in NLP.
“This is when we established the golden formula for transfer learning in NLP”
Transfer Learning in NLP = Pre-Training and Fine-Tuning
Most of the NLP breakthroughs that followed ULMFIT tweaked components of the above equation and gained state-of-the-art benchmarks.
OpenAI’s GPT :-
OpenAI’s GPT extended the methods of pre-training and fine-tuning that were introduced by ULMFiT and ELMo. GPT essentially replaced the LSTM-based architecture for Language Modeling with a Transformer-based architecture.
The GPT model could be fine-tuned to multiple NLP tasks beyond document classification, such as common sense reasoning, semantic similarity, and reading comprehension.
GPT also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism.
OpenAI’s GPT validated the robustness and usefulness of the Transformer architecture by achieving multiple State-of-the-Arts.
How to use Pre-trained language model :-
- The fastai library provides modules necessary to train and use ULMFiT models. Moreover, pre-trained Wikitext 103 model is also available.
- Allen Institute for Artificial Intelligence provides pre-trained ELMo models in English and Portuguese. You can also retrain models using TensorFlow code.
- You are also free to use pre-trained BERT models released by Google Research team and hugging face libraries.
BERT :-
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.
BERT is different due to the following reasons :-
- Bidirectional: BERT is naturally bidirectional
- Generalizable: Pre-trained BERT model can be fine-tuned easily for downstream NLP task
- High-Performance: Fine-tuned BERT models beat state-of-the-art results for many NLP tasks
- Universal: Trained on Wikipedia + BookCorpus. No special dataset needed
The figure below shows two steps of how BERT has been developed :-

BERT Model Architecture :-
There are two models introduced in the BERT paper.
BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
BERT Large – 24 layers, 16 attention heads and, 340 million parameters.

Let’s recap the Transformer Architecture :-
BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.
Because the use of Transformers has become common and the implementation is almost identical to the original, in this implementation will omit an exhaustive background description of the model architecture.

How does BERT work :-
BERT weights are learned in advance through two unsupervised tasks: masked language modelling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.
Multi-Headed attention is used in BERT. It uses multiple layers of attention and also incorporates multiple attention “heads” in every layer (12 to 16). Since the model weights are not shared between layers, a single BERT model effectively has up to 12 X 12 = 144 different attention mechanisms.

1.Position Embeddings: BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information
2.Segment Embeddings: BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
3.Token Embeddings: These are the embeddings learned for the specific token from the WordPiece token vocabulary
Flow of information of a word in BERT : A word starts with its embedding representation from the embedding layer. Every layer does some multi-headed attention computation on the word representation of the previous layer to create a new intermediate representation. All these intermediate representations are of the same size. In the figure above, E1 is the embedding representation, T1 is the final output and Trm are the intermediate representations of the same token. In a 12-layers BERT model a token will have 12 intermediate representations.

BERT Tokenization Strategy :-
BERT uses WordPiece tokenization. The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added.

How is the input text represented before Feeding to the BERT?
The input representation used by BERT is able to represent a single text sentence as well as a pair of sentences (eg., [Question, Answer]) in a single sequence of tokens.
The first token of every input sequence is the special classification token – [CLS]. This token is used in classification tasks as an aggregate of the entire sequence representation. It is ignored in non-classification tasks.
For single text sentence tasks, this [CLS] token is followed by the WordPiece tokens and the separator token – [SEP].

For sentence pair tasks, the WordPiece tokens of the two sentences are separated by another [SEP] token. This input sequence also ends with the [SEP] token.

- A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar to token/word embeddings with a vocabulary of 2.
- A positional embedding is also added to each token to indicate its position in the sequence.
Masked Language Modelling :-
Language Modeling is the task of predicting the next word given a sequence of words. In masked language modeling instead of predicting every next token, a percentage of input tokens is masked at random and only those masked tokens are predicted.
The masked words are not always replaced with the masked token – [MASK] because then the masked tokens would never be seen before fine-tuning. Therefore, 15% of the tokens are chosen at random and –
- 80% of the time tokens are actually replaced with the token [MASK].
- 10% of the time tokens are replaced with a random token.
- 10% of the time tokens are left unchanged.
Why is masking used in BERT :-
Bi-directional models are more powerful than uni-directional language models. But in a multi-layered model bi-directional models do not work because the lower layers leak information and allow a token to see itself in later layers.
Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context. This made our models susceptible to errors due to loss in information.
ELMo tried to deal with this problem by training two LSTM language models on left-to-right and right-to-left contexts and shallowly concatenating them. Even though it greatly improved upon existing techniques, it wasn’t enough.

From the above image it’s clear that BERT is bi-directional, GPT is unidirectional (information flows only from left-to-right), and ELMO is shallowly bidirectional.
Next Sentence Prediction (NSP) :-
Next sentence prediction task is a binary classification task in which, given a pair of sentences, it is predicted if the second sentence is the actual next sentence of the first sentence.
This task can be easily generated from any monolingual corpus. It is helpful because many downstream tasks such as Question and Answering and Natural Language Inference require an understanding of the relationship between two sentences.

Implementing BERT Model :-
Class: Source code transformers.configuration_bert

Source code for transformers.configuration_bert


Example: transformers.configuration_bert

Task-specific models :-
The BERT paper shows a number of ways to use BERT for different tasks.

BERT: Pretraining and Fine Tuning
There are two existing strategies for applying pre-trained language representations to downstream tasks which BERT has been pre-trained on:
- feature-based
- fine-tuning
The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features.
The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.
*Notes: The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

Model Training :-
Pre-training phase :-
In pre-training phase, sentences are retrieved from BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2500M words).
- Masked LM: 512 tokens per sequence (2 concatenated sentences) will be used and there are 256 sequences per batch. Approximate 40 epochs is set to train a model.
The configuration is:
- Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999
- L2 weight decay of 0.01
- 0.1 dropout for all layers
- Using gelu for activation
Fine-tuning phase :-
Only some model hyperparameters are changed such as batch size, learning rate and number of training epochs, most mode hyperparameters are kept as same in pre-training phase. During the experiments, the following range of value work well across tasks:
- Batch Size: 16, 32
- Learning Rate: 5e-5, 3e-5, 2e-5
- Number of epochs: 3, 4
- Fine-tuning procedure is different and it depends on downstream tasks.
Classification :-
For [CLS] token, it will be feed as the final hidden state. Label (C)probabilities are computed with a SoftMax. After that it is fine-tuned to maximize the log-probability of the correct label.

Named Entity Recognition :-
Final hidden representation of token will be feed into the classification layer. Surrounding words will be be considered on the prediction. In other words, the classification only focus on the token itself and no Conditional Random Field (CRF).

Experiments and Results :-
BERT outperformed the state-of-the-art across the following tasks:
- Language understanding
- Natural language inference
- Paraphrase detection
- Sentiment analysis
- Linguistic acceptability analysis
- Semantic similarity analysis
- Textual entailment
BERT not only outperformed traditional word-embedding based approaches but also outperformed new methods such as ELMo.


BERT Model Evaluation Method :-
We mainly use two types of evaluation metrics, Exact Match and F1 score. Exact Match(EM) is a binary measure (i.e. true/false) of whether the system output matches the ground truth answer exactly. In our evaluation, EM stands for the percentage of outputs that match exactly with the ground truth. F1 is the harmonic mean of precision and recall, more specifically:

For questions that do have answers, we take the maximum F1 and EM scores across the three human-provided answers for that question. And for those without answers, both the F1 and EM score are 1 if the model predicts no-answer, and 0 otherwise.
CellStrat Training Course on “Natural Language Processing with Deep Learning” :-
Learn advanced NLP with CellStrat’s hands-on course on “Natural Language Processing with Deep Learning”.
Details and Enrollment : https://bit.ly/CSNLPC
Questions ? Please contact us at +91-9742800566 !
References :-
Semi-supervised Sequence Learning. Andrew M. Dai, Quoc V. Le. NIPS 2015. [pdf]
context2vec: Learning Generic Context Embedding with Bidirectional LSTM. Oren Melamud, Jacob Goldberger, Ido Dagan. CoNLL 2016. [pdf] [project] (context2vec)
Unsupervised Pretraining for Sequence to Sequence Learning. Prajit Ramachandran, Peter J. Liu, Quoc V. Le. EMNLP 2017. [pdf] (Pre-trained seq2seq)
Deep contextualized word representations. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee and Luke Zettlemoyer. NAACL 2018. [pdf] [project] (ELMo)
Universal Language Model Fine-tuning for Text Classification. Jeremy Howard and Sebastian Ruder. ACL 2018. [pdf] [project] (ULMFiT)
Improving Language Understanding by Generative Pre-Training. Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. Preprint. [pdf] [project] (GPT)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. NAACL 2019. [pdf] [code & model]
Language Models are Unsupervised Multitask Learners. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. Preprint. [pdf] [code] (GPT-2)
Unified Language Model Pre-training for Natural Language Understanding and Generation. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon. Preprint. [pdf] (UniLM)
XLNet: Generalized Autoregressive Pretraining for Language Understanding. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. NeurIPS 2019. [pdf] [code & model]
RoBERTa: A Robustly Optimized BERT Pretraining Approach. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. Preprint. [pdf] [code & model]
SpanBERT: Improving Pre-training by Representing and Predicting Spans. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy. Preprint. [pdf] [code & model]
Knowledge Enhanced Contextual Word Representations. Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, Noah A. Smith. EMNLP 2019. [pdf] (KnowBert)
VisualBERT: A Simple and Performant Baseline for Vision and Language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. Preprint. [pdf] [code & model]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. NeurIPS 2019. [pdf] [code & model]
VideoBERT: A Joint Model for Video and Language Representation Learning. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid. ICCV 2019. [pdf]
Language Models are Unsupervised Multitask Learners. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. Preprint. [pdf] [code] (GPT-2)
ERNIE: Enhanced Language Representation with Informative Entities. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun and Qun Liu. ACL 2019. [pdf] [code & model] (ERNIE (Tsinghua) )
ERNIE: Enhanced Representation through Knowledge Integration. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian and Hua Wu. Preprint. [pdf] [code] (ERNIE (Baidu) )
Defending Against Neural Fake News. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi. NeurIPS 2019. [pdf] [project] (Grover)
Cross-lingual Language Model Pretraining. Guillaume Lample, Alexis Conneau. NeurIPS 2019. [pdf] [code & model] (XLM)
Multi-Task Deep Neural Networks for Natural Language Understanding. Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao. ACL 2019. [pdf] [code & model] (MT-DNN)
MASS: Masked Sequence to Sequence Pre-training for Language Generation. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. ICML 2019. [pdf] [code & model]