Popular pretraining methods in the modern NLP world

5 min readJun 18, 2021

I. Introduction

Pre-training in the modern NLP world is a common practice in which we train a model using a massive amount of data (mostly under an unsupervised manner) to learn general linguistic knowledge. The model, now equipped with initialized parameters, can be applied to solve other downstream tasks. The pretraining concept is actually inspired by human beings since with our innate ability, we rarely learn new things from scratch, but instead, we leverage our previous knowledge to quickly comprehend new things. In this short article, I will give an overview of some well-known pretraining methods that are behind recent SOTA NLP deep neural network architectures.

II. Pretraining methods

1. Continuous Bag of Words (CBOW)

The first method is for embedding words, called CBOW, introduced in the famous Word2Vec model. The model was pre-trained on text corpus to capture the co-occurrence statistics between words. The main idea in CBOW is to combine distributed representations of context (or surrounding words) to predict the word in the middle. The model converts a sentence into word pairs in the form (context_word, target_word). For instance, with a sample sentence like “This is a wonderful world” and a pre-selected window size of 2, the word pairs would be like ([it, a], is), ([a, wonderful], world),([a, world], wonderful)… The processed input will be fed into a one-hidden layer network to learn embeddings.

Image source: https://arxiv.org/pdf/1301.3781.pdf

2. Skip-gram

Another method introduced with the Word2Vec model is Skip-gram. Opposite to CBOW, this method attempts to predict the context from the target word. To construct the input dataset, the model typically inverts the contexts and targets and tries to predict each context word from its target word. Hence the task in the above example becomes to predict the context [a, wonderful] given target word ‘world or [this, a] given target word ‘is’ and so on.

Even though the 2 methods consider the context when embedding words, there are serious problems with the mechanism. It is static. Each word would associate with a fixed embedding. This leads to embedding polysemous words that might not always accurate. Also, when combining context words, summation might not be the optimal way to capture the relationship between tokens in a sequence as the tokens at different positions in the 2 methods are treated the same.

3. Language modeling (LM)

Later architectures use other training methods to overcome these issues. One of the most successful methods is language modeling (LM). LM is versatile and enables learning both sentence and word representations with a variety of objective functions. In conventional LM models, unidirectional context either in the forward or backward direction in a text sequence is encoded. Here, directionality is important for generating well-formed probability distribution.

Image source: https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

However, language understanding is bidirectional and some of the downstream NLP tasks do require this way of understanding. That is where LM models fall short.

4. Masked Language Model (MLM)

Masked language modeling is a bidirectional pretraining method. Firstly, the input is corrupted, the model will try to reconstruct the input. Embedding of each word or token in the input will be learned about all other words or tokens in the input sequence. But how much should we mask a sentence? If we mask a sentence too little, it’s going to be expensive to train. In contrast, if we mask it too much, the model will have less context when embedding. In the Bert model, the authors proposed masking 15% of the tokens in a sentence. Then, the model predicts those masked words given the other words. By training the model with such an objective, it can essentially learn certain (but not all) statistical properties of word sequences.

Image source: https://www.researchgate.net/publication/337187647_Self-Supervised_Contextual_Data_Augmentation_for_Natural_Language_Processing

The main disadvantage of MLM over its next-word prediction predecessor is reduced sample efficiency. For example, Bert only predicts 15% of the tokens, not entire tokens in a sequence. Additionally, the [MASK] tokens introduce a discrepancy between the inputs observed during the pre-training and fine-tuning stages, since downstream tasks do not mask their inputs.

5. Next Sentence Prediction (NSP)

The second proposed pretraining method also in Bert is Next Sentence Prediction(NSP), It is a binary classification task involving predicting whether the second sentence succeeds the first sentence in the corpus. In the BERT training process, 50% of the inputs are truly subsequent sentences, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random one will be not related to the first sentence. Besides helping the model to understand the relationship between 2 sentences, this training scheme offers extra benefits as the model can now be fine-tuned for the tasks involving two sequences like question answering or natural language inference.