Learn about key techniques used for BERT optimisation
The appearance of the BERT model led to significant progress in NLP. Deriving its architecture from Transformer, BERT achieves state-of-the-art results on various downstream tasks: language modeling, next sentence prediction, question answering, NER tagging, etc.
Despite the excellent performance of BERT, researchers still continued experimenting with its configuration in hopes of achieving even better metrics. Fortunately, they succeeded with it and presented a new model called RoBERTa — Robustly Optimised BERT Approach.
Throughout this article, we will be referring to the official RoBERTa paper which contains in-depth information about the model. In simple words, RoBERTa consists of several independent improvements over the original BERT model — all of the other principles including the architecture stay the same. All of the advancements will be covered and explained in this article.
From the BERT’s architecture we remember that during pretraining BERT performs language modeling by trying to predict a certain percentage of masked tokens. The problem with the original implementation is the fact that chosen tokens for masking for a given text sequence across different batches are sometimes the same.
More precisely, the training dataset is duplicated 10 times, thus each sequence is masked only in 10 different ways. Keeping in mind that BERT runs 40 training epochs, each sequence with the same masking is passed to BERT four times. As researchers found, it is slightly better to use dynamic masking meaning that masking is generated uniquely every time a sequence is passed to BERT. Overall, this results in less duplicated data during the training giving an opportunity for a model to work with more various data and masking patterns.
The authors of the paper conducted research for finding an optimal way to model the next sentence prediction task. As a consequence, they found several valuable insights:
- Removing the next sentence prediction loss results in a slightly better performance.
- Passing single natural sentences into BERT input hurts the performance, compared to passing sequences consisting of several sentences. One of the most likely hypothesises explaining this phenomenon is the difficulty for a model to learn long-range dependencies only relying on single sentences.
- It more beneficial to construct input sequences by sampling contiguous sentences from a single document rather than from multiple documents. Normally, sequences are always constructed from contiguous full sentences of a single document so that the total length is at most 512 tokens. The problem arises when we reach the end of a document. In this aspect, researchers compared whether it was worth stopping sampling sentences for such sequences or additionally sampling the first several sentences of the next document (and adding a corresponding separator token between documents). The results showed that the first option is better.
Ultimately, for the final RoBERTa implementation, the authors chose to keep the first two aspects and omit the third one. Despite the observed improvement behind the third insight, researchers did not not proceed with it because otherwise, it would have made the comparison between previous implementations more problematic. It happens due to the fact that reaching the document boundary and stopping there means that an input sequence will contain less than 512 tokens. For having a similar number of tokens across all batches, the batch size in such cases needs to be augmented. This leads to variable batch size and more complex comparisons which researchers wanted to avoid.
Recent advancements in NLP showed that increase of the batch size with the appropriate decrease of the learning rate and the number of training steps usually tends to improve the model’s performance.
As a reminder, the BERT base model was trained on a batch size of 256 sequences for a million steps. The authors tried training BERT on batch sizes of 2K and 8K and the latter value was chosen for training RoBERTa. The corresponding number of training steps and the learning rate value became respectively 31K and 1e-3.
It is also important to keep in mind that batch size increase results in easier parallelization through a special technique called “gradient accumulation”.
In NLP there exist three main types of text tokenization:
- Character-level tokenization
- Subword-level tokenization
- Word-level tokenization
The original BERT uses a subword-level tokenization with the vocabulary size of 30K which is learned after input preprocessing and using several heuristics. RoBERTa uses bytes instead of unicode characters as the base for subwords and expands the vocabulary size up to 50K without any preprocessing or input tokenization. This results in 15M and 20M additional parameters for BERT base and BERT large models respectively. The introduced encoding version in RoBERTa demonstrates slightly worse results than before.
Nevertheless, in the vocabulary size growth in RoBERTa allows to encode almost any word or subword without using the unknown token, compared to BERT. This gives a considerable advantage to RoBERTa as the model can now more fully understand complex texts containing rare words.
Apart from it, RoBERTa applies all four described aspects above with the same architecture parameters as BERT large. The total number of parameters of RoBERTa is 355M.
RoBERTa is pretrained on a combination of five massive datasets resulting in a total of 160 GB of text data. In comparison, BERT large is pretrained only on 13 GB of data. Finally, the authors increase the number of training steps from 100K to 500K.
As a result, RoBERTa outperforms BERT large on XLNet large on the most popular benchmarks.
Analogously to BERT, the researchers developed two versions of RoBERTa. Most of the hyperparameters in the base and large versions are the same. The figure below demonstrates the principle differences:
The fine-tuning process in RoBERTa is similar to the BERT’s.
In this article, we have examined an improved version of BERT which modifies the original training procedure by introducing the following aspects:
- dynamic masking
- omitting the next sentence prediction objective
- training on longer sentences
- increasing vocabulary size
- training for longer with larger batches over data
The resulting RoBERTa model appears to be superior to its ancestors on top benchmarks. Despite a more complex configuration, RoBERTa adds only 15M additional parameters maintaining comparable inference speed with BERT.
All images unless otherwise noted are by the author