EOTV OS LOR AND UNIVERSITY FACULTY OF SCIENCE

EOTVOS LORAND UNIVERSITY

FACULTY OF SCIENCE

TRANSFORMERS AND THEIR APPLICATIONS IN NATURALLANGUAGE PROCESSING

MASTER’S THESIS

Nurzhan SarzhanMSc mathematics

Supervisor: Andras LukacsDepartment of Computer Science

Budapest2021

Contents

1 Introduction 6

2 Prerequisites 10

3 Literature Review 11

4 Historical Background 13

4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Theory Behind Transformers 16

5.1 NLP Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4 Sequence-To-Sequence Modelling . . . . . . . . . . . . . . . . . . . . . . . . 20

5.5 The Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.5.1 Encoder Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.5.2 Decoder Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 NLP Tasks Overview 25

7 Transformers Overview 27

7.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.1.1 Masked Language Modelling . . . . . . . . . . . . . . . . . . . . . . 27

7.1.2 Next Sentence Prediction . . . . . . . . . . . . . . . . . . . . . . . . 28

7.2 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2.1 Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2.2 Attention mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.2.3 Two-Stream Self-Attention . . . . . . . . . . . . . . . . . . . . . . . 30

7.3 T5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.3.1 A Huge Dataset (C4) . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.3.2 A Systematic Study Of Transfer Learning Methodology . . . . . . . 31

7.4 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.4.1 GPT-2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.5 Distilled versions of BERT and GPT-2 . . . . . . . . . . . . . . . . . . . . . 36

3

8 Evaluation Metrics 38

8.1 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.1.2 Precision, Recall, F-measure . . . . . . . . . . . . . . . . . . . . . . . 39

8.2 BLEU score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

9 Experiments 41

9.1 Sentiment Analysis (Sequence Classification) . . . . . . . . . . . . . . . . . 42

9.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

9.3 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9.4 Language Modelling (Text Generation) . . . . . . . . . . . . . . . . . . . . . 47

10 Conclusion 48

4

Abstract

Natural Language Processing (NLP) is nowadays one of the highly developing fieldsin Deep Learning. A new neural network architecture called the Transformer made arevolution in NLP and broadened horizons for future discoveries. Researchers aroundthe world and big tech companies such as Google and Facebook are working on develop-ing new architectures (based on the Transformer’s core feature - attention mechanism)in order to improve its performance, achieve state-of-the-art results and beat variouslanguage modelling benchmarks (GLUE, SQuAD, etc.) for further development oftheir products (artificial intelligence–powered virtual assistants, chatbots, analytic ofcustomer behavior, translation of texts, and so on). Thereby, dozens of outstandingTransformer based models were proposed during the last few years. They are differ-ent in many ways, for example, some of them are enormously huge for fitting themin a simple computer and require huge computational resources to train them, whileothers are tiny enough to work on weak devices such as smartphones. Some modelsare designed for special NLP tasks (like sequence classification or translation), whileothers are build in a way to be applicable for any known NLP challenge. Nonetheless,all of them have one mechanism in common - self-attention, which unites them in oneneural networks family called transformers.

The goal of my thesis was to investigate the most influential transformer architec-tures such as BERT, GPT-2, DistilBERT, DistilGPT-2, XLNet and T5 and to applythem for real-world NLP tasks. I have learned about theoretical part of each of the net-work through reading related research papers together with explanatory articles in or-der to understand how are they working under the hood and to reveal their advantagesand drawbacks comparing to others. Further, I have fine-tuned these architectures onwell-known publicly-available datasets for essential NLP tasks (sequence classification,translation, question answering and text generation) with various training parametersand then compared with each other to figure out their pros and cons.

As a result, I have learned, that it is a really difficult task to achieve state-of-the-artresults with limited computational resources and time for tuning models. However, itis still possible to get satisfactory accuracy with smart choice of hyperparameters andusing fine-tuning on pre-trained models with available resources.

5

1 Introduction

The Transformer is a Deep Learning model introduced in 2017 that utilizes the mechanismof self-attention. Shortly, this mechanism weighs the influence of different elements ofthe input sequence of data. Self-Attention is mostly used primarily in the field of NLPnowadays, although last researches show that the mechanism is also applicable for computervision as well.

Similar to Recurrent Neural Networks (RNNs), transformers are designed to handlesequential input data (natural language, time series, etc.) to solve NLP tasks such as ques-tion answering, text generation and summarization. However, unlike RNNs, transformersdo not require that the sequential data be processed in order. Rather, the attention mech-anism identifies context for any position given the whole input sequence. Transformers donot require to process the beginning of the input before the end. Due to this feature, theTransformer’s architecture is build for making it more convenient to run in parallel ratherthan RNNs (sequentially). This distinction results in a reduced time of training.

In a few years transformers suddenly became the model of choice for the most NLPproblems, replacing previously most popular RNN models such as the Long Short-TermMemory (LSTM) and Gated Recurrent Units (GRU). Since the Transformer model is easyfor parallelization during the training process, it has enabled option for researchers to trainmodels on huge datasets. This led to the development of pre-trained models like BERT(Bidirectional Encoder Representations from transformers) and T5 (Text-To-Text TransferTransformer), which were pre-trained on colossal general language datasets, available forfurther fine-tuning to other specific NLP tasks. These features motivated me to choose themost influential transformers, to investigate their technical features, to fine-tune for a fewessential NLP tasks on publicly available datasets and then to analyze findings.

When I have started working, the first part of the research was to choose list of taskson which I prefer to work. Although there are many interesting NLP challenges, I wascurious about choosing the ones that differ from each other as much as possible both fromtheoretical and practical sides. Therefore, I have decided to discover transformers on nextNLP tasks:

• Sequence classification (sentiment analysis). I’ve added this task to my listbecause it’s one of the most popular tasks in NLP. Here we need to decide whichcategory the text belongs to (positive or negative sentiment, topics of articles likesports, culture or news, or does the quote belong to certain author or not).

• Language modelling (text generation). In my opinion it is the most interestingtask. Here model tries to predict the token based on the previously known sequenceof tokens. Using this technique in an autoregressive way makes the magic, becausethe model can generate long texts (sentences, dialogues and even poems) until somestopping condition is satisfied.

• Translation. I have added this NLP task into my research plan, because it requirespresence of both encoder and decoder blocks in the architecture. Also, only in trans-lation we need to work with two languages at the same time (source and target), whileothers use only one language and most of them are using either encoder or decoderin the architecture.

• Question Answering. This is, probably, the hardest task, from technical side, fortransformers, because it requires some additional pre-processing steps to be applied inthe pipeline before training. Question answering has some similarities with sequenceclassification. The only difference is that except of the target class, we need to predicttwo values: an index of the token where the answer on the question begins, and thelength of the answer.

6

The next step was to choose architectures, which can be applied for training for thechosen NLP challenges. As it was mentioned previously, there are dozens of amazing trans-formers which were proposed during the last few years. Although every network is builtin a unique way and each of them is different from others, we can divide all transformersinto three major categories: Autoregressive, autoencoding and sequence-to-sequence mod-els. As mentioned below, the first Transformer architecture consists of two essential blocks- encoder and decoder. The first block tries to encode a sequence of data into a vectorrepresentation, while the decoder iteratively outputs another sequence of data given in-put from encoder together with additional input. Transformers, which are built on bothencoder-decoder constructions are sequence-to-sequence models. Models, which have onlydecoder part of the initial Transformer are autoregressive models. And ones, which haveonly encoder under the hood are autoencoders.

So, my goal was to choose models from each category in order to learn about each familyof transformers. Moreover, I wanted to explore about the most influential ones, which hadpushed the development of a field of NLP. Here is a list of transformer architectures, whichI have chosen for my research:

• BERT from autoencoders. A simple and powerful at the same time model developedby Google team, which has expanded the boundaries of transformers research furtherafter the Transformer did it. BERT is considered as a good baseline for NLP taskslike sequence classification, summarization and masked language modelling (MLM).

• GPT-2 belongs to autoregressive models. It has became the biggest network whenit was firstly introduced. Due to it’s autoregressive nature (fully relies on decoder),model performs amazingly on language modelling tasks. Nonetheless if can be con-figured for other tasks as well.

• DistilBERT belongs to autoencoders models as well. It is basically the smallerversion of pre-trained BERT. The reason why I have added this architecture to thelist of models is that this model is trained by an incredible technique called distilla-tion, which allows to substantially reduce the size of the base (teacher) model withinsignificant loss of accuracy on the distilled (student) model.

• DistilGPT-2 is obtained by distillation method (like DistilBERT). The model weighs37% less, and is twice as fast as its OpenAI teacher predecessor, while keeping the samegenerative power. DistilGPT-2 runs smoothly on small devices (like smartphones) andcan generate coherent sequences of text.

• T5 from sequence-to-sequence models. T5 uses traditional Transformer’s architecturewith a few changes. It is able to operate on all NLP tasks by simply transformingthem into text-to-text problems. Nonetheless, the most natural task for T5 is texttranslation. Since it’s the only model, which I have chosen for translation task, Ihave used two versions of it in my experiments: T5-small and T5-base, which differfrom each other only by their sizes. Moreover, in order to diversify experiments ontranslation task, I’ve been using T5 for two language pairs (English-German andEnglish-Russian).

• XLNet was also added to the list of models, which will be used in the research, be-cause it is not a traditional autoencoding model. The key feature of this transformer isthat it permutes the tokens in the the sentence in order to predict n+1-th token givenn last tokens. Furthermore, XLNet uses recurrence mechanism (like Transformer-XL)to be able to work with long sequences, which in some sense makes the model similarto RNNs.

7

Figure 1: Research Plan

Each NLP task requires a dataset to train on. Fortunately, there is a great amountof amazing publicly available datasets aimed for real case scenario. Nonetheless, my goalof research was just an overview of transformers, not to build a model, which solves somespecial business case. Therefore, I was motivated to use commonly known research datasetsinstead, which are widely used by Deep Learning community and highly recognized in NLP.They are:

• Large Movie Review Dataset [9]. This is a dataset for binary sentiment classi-fication. Each record from dataset is a user review on a movie with two categories,where 0 means that the review is negative and 1 is positive. Dataset contains 25,000highly polar movie reviews for training, and 25,000 for testing. This is a good baselinedataset for my investigation of a sequence classification task.

• WMT16 Dataset [3]. The recurring translation task of the WMT workshops fo-cuses on news text and European language pairs. I have used English-German andEnglish-Russian language pairs to tuning model on neural machine translation task.Structure of WMT16 is quite simple. It consists of two parts: source sentence inEnglish and target sentence in German/Russian languages, which is a syntacticallyand grammatically correct translation of the source sentence.

• SQuAD Dataset [15]. Stanford Question Answering Dataset (SQuAD) is a readingcomprehension dataset, consisting of questions posed by crowdworkers on a set ofWikipedia articles, where the answer to every question is a segment of text, or span,from the corresponding reading passage, or the question might be unanswerable.

SQuAD 1.1 is the old version of the SQuAD dataset, it contains more than 100,000question-answer pairs based on more than 500 articles.

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswer-able questions written adversarially by crowdworkers to look similar to answerableones. To do well on SQuAD2.0, systems must not only answer questions when pos-sible, but also determine when no answer is supported by the paragraph and abstainfrom answering.

8

• WikiText dataset [18]. This language modeling dataset is a collection of over100 million tokens extracted from the set of verified good and featured articles onWikipedia. I used wikitext-2-raw-v1 version of the dataset in the language modellingtask. The dataset has 600 articles for training and 60 for validation. Total size of thedataset is 17.41 MB.

Lastly, after finishing these two steps, I had pairs of task-model, on which I needed tomake experiments. However, it is not the end, because there are also few options of trainingthe network. I’ve decided to three main techniques to train transformers:

• Training from scratch. In this scenario all weights of both tokenizer and themodel are being initialized in the beginning of the training. The first step is to train atokenizer on the dataset which we have. After embeddings are fitted enough, we canjump to the training of the model. This method is a very complex task, because togain a satisfactory results, we need have a huge computational power, which can beafforded only by some IT giants like Google and OpenAI, because the training processof a huge transformer on only few GPUs can take months or even years of straightcomputing. Additionally, the size of dataset plays an essential role as well. To traina good transformer network from scratch we should train it on the dataset which isas big as possible (dozens of gigabytes). However, due to the limited availability ofthis requirements, this training scenario is very costly for my research. Nonetheless,I tried to use it in simple cases (BERT and DistilBERT of sequence classification).You can look on results in the Experiments section.

• Transfer Learning of the previously pre-trained model. In this case, we just takea model, which was already trained on some big dataset with usage of huge compu-tational resources for some NLP tasks, and train it on our dataset to solve our task.It means, that we use token embeddings and weights of pre-trained model and justtune it. This technique allows us to achieve good results with much less effort thantraining from scratch.

• Fine-tuning. This method actually belongs to transfer learning, but with somechanges in the procedure. While, in simple transfer learning we train the whole pre-trained neural network on our dataset, in fine-tuning we freeze layers of the basemodel and train only the head layers, which are different for the specific task. So,in case of sequence classification, our task is just to fit weights of the classifier layer.This technique is very efficient, cause it does not require lot’s of GPU resources andfits very fast with small losses in final quality.

9

2 Prerequisites

Transformer is a very complex topic for research. In order to understand every aspect ofthe transformers, it requires to have a solid knowledge in number of subjects. Crucial onesare:

• Linear Algebra. It’s probably the most important prerequisite for any data scienceproject, since all the calculations are mostly made with matrices and thus knowledgeof matrices and tensors (3D variant of matrix), and operations on them (Inverse ofmatrix, eigenvectors and eigenvalues, norms and decomposition algorithms).

• Probability and Statistics. Since Deep Learning is mostly probabilistic thing andevery researcher has to deal with non-deterministic stuff, it crucial to have a basicknowledge of random variables, distributions (Bernoulli, Gaussian) and their commonproperties (PDF, Expectation, Variance, etc.).

• Programming skills (preferably Python). It’s mostly related to practical part.Nonetheless, it’s important to make hands dirty to understand deeply, because youdo not know anything until you implement it by yourself. Over the time, it turnedout that Python became the number one programming language used for Data Sci-ence pipelines. It happened mostly because of it’s simple syntax. Python is a highlevel programming language which is easy for learning without previous backgroundin computer science. It’s quite convenient for researchers, who are far away from pro-gramming for some reason, but have a solid background in mathematics. Moreover,Python’s syntax allows researchers to think more about algorithms and mathemati-cal aspects of neural networks rather than spending time on coding parts (memoryallocation, variable declarations and so on).

• Machine Learning basics. Since transformers is just an machine learning algo-rithm, it’s better to know essentials of machine learning and data science. Especially,it’s necessary to understand data science pipelines, how to pre-process data, splitdataset into training, evaluation and testing sets, tune hyperparameters and evaluatethem with some metrics.

• NLP basics. We are going to deal with texts, which is a non-trivial task, becausecomputers can not understand human language like we do. They can only interactwith numbers. Therefore, we need to know how to overcome this obstacle and NLPcan help use. Knowledge of methods of processing texts (text cleaning, tokenization,word embeddings) is again, matters a lot.

• Deep Learning basics. Of course, transformer is a Deep Learning algorithm andit’s impossible to succeed in it without understanding of essential ideas like:

– Hyperparameters (learning rate, weight decay, batch size, number of epochs);

– Hidden layers (Dense, Maxpooling, Dropout, etc.);

– Optimizers (SGD, Adam, AdamW, etc.).

• Knowledge of a Deep Learning framework (Tensorflow or Pytorch). Al-though it’s possible to build networks from scratch with Numpy, it is a very compli-cated exercise and requires a lot of effort to build a simple network, while transformersare the most complicated architectures nowadays. Moreover, most of the time theyrequire a parallel execution, which makes them much more harder to develop fromscratch. So, to overcome it, there are Deep Learning frameworks, which have build-ing blocks for training neural networks (layers, autograd, optimizers, loss functions,schedulers and much more). I have chosen Pytorch, which has a convenient pythonicinterface and easy for researches.

10

3 Literature Review

Literature played an important role in my research, I have read dozens of articles fromMedium, Hugginface documentation and Github pages of various researchers and officialpapers of network developers. Since, Deep Learning and, especially transformer, is a rela-tively new direction in the modern mathematics, new investigations and papers are beingpublished very often. Therefore I have decided to use historically the most important ones,which made a huge impact on the sphere at all. They are:

• Attention is all you need [22]. written by Ashish Vaswani et al where authorsintroduced Transformer’s architecture was the most important source in my projectwork and probably not only in my work, but also in papers of other Deep Learningresearchers of last years. This architecture, based on already know mechanism calledAttention, made a huge revolution in NLP, because it does not suffer from problemswhich RNNs had such as gradient vanishing and exploding problems, and complexityof parallel computing.

• Book Deep Learning [6] written by Ian Goodfellow, is truly one of the best booksabout neural networks. Thanks to it, I understood mathematics, which are workingunder the hood, and it helped me to improve my overall skills in Deep Learning. Thisbook covers all necessary aspects regarding essentials of Deep Learning, starting fromapplied math and machine learning basics in the first part, continuing with mod-ern practices in deep networks (Feedforward, Convolutional and Recurrent networks,regularizations and optimizations) and finishing with more complex structures likeautoencoders and GANs (generative adversarial networks).

• BERT [5]. This network was proposed by Jacob Devlin, Ming-Wei Chang, Ken-ton Lee and Kristina Toutanova in ”DistilGPT-2: Pre-training of Deep Bidirectionaltransformers for Language Understanding” paper in 2018 and achieved state-of-the-artperformance on various natural language understanding (NLU) tasks. For instance,BERT has pushed the GLUE score to 80.5% (7.7% point absolute improvement). Itis a Transformer-based machine learning technique developed by Google in order tobetter understand users behavior online and as a result, in October 2020 almost everysearch english-based query in Google Search is processed by BERT and was adoptedfor more than 70 languages at all. The main reason why BERT is so powerful is thatBERT is deeply bidirectional transformer (captures context from both sides left-to-right and right-to-left), while others were working only in one direction (from left toright).

• DistilBERT [16]. This architecture was introduced in the paper DistilBERT, adistilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysan-dre Debut, Julien Chaumond and Thomas Wolf. DistilBERT is a lightweight versionBERT, which achieves 95% performance of BERT and runs 60% faster and has 40%less parameters than base BERT model.

• XLNet [24]. The model was proposed in XLNet: Generalized Autoregressive Pre-training for Language Understanding by Zhilin Yang et al, which is an extension ofthe transformers-XL model pre-trained using an autoregressive method to learn bidi-rectional contexts by maximizing the expected likelihood over all permutations of theinput sequence factorization order.

• The T5 model [14] was presented in Exploring the Limits of Transfer Learningwith a Unified Text-to-Text transformer by Colin Raffel, Noam Shazeer, et al. T5 isan encoder-decoder model pre-trained on a multi-task mixture of unsupervised andsupervised tasks and for which each task is converted into a text-to-text format. T5

11

works well on a variety of tasks out-of-the-box by prepending a different prefix to theinput corresponding to each task.

• GPT-2 [13]. These architectures do not have big differences except of sizes anddatasets used for training. GPT-2 is much bigger than GPT-2 and it was trainedon bigger dataset (WebText against Book Corpus). Anyway, these papers helpedme to understand architectures in details and distinguish from the base Transformerarchitecture. Moreover, GPT-2 makes it clear that ”in transformers, bigger is better”and outstanding results can not be achieved without huge datasets and unlimitedcomputational resources.

• The Illustrated Transformer [1]. This article is one of the best blog posts, whichdescribed all the aspects of the transformers. Author cover’s topic such as Encoderand Decoder blocks, Self-Attention and Multi-Head Self-Attention together with il-lustrations of the whole pipeline how text sequence goes through the transformers.

• BERT Explained: State of the art language model for NLP [7]. Author ofthis article describes every part of BERT in detail, starting from the main idea ofthe transformers, nuances of how it works and ends with recommendations how tofine-tune it and results, which were achieved by this model.

12

4 Historical Background

Speech, probably the most important feature of human species which differs us from allother animals. Ability to communicate with each other gave us an invaluable advantageand thanks to it, we were able to unite into societies, survive and develop together andachieve outstanding results in evolution comparing to other species living on Earth. Speechis the most effective way of communication, with it we can share our thoughts, dreams andideas, manage projects, warn other people about danger and much more.

We are now at the peak of interest in NLP, a field of computer science that focuses on thelinguistic interaction of man and machine. Thanks to breakthroughs in machine learning(ML) in the last decade, we are seeing dramatic improvements in speech recognition andmachine translation. Language generators are already good enough to write coherent newsarticles, and virtual assistants like Siri and Alexa are becoming a part of our daily lives.

In the first decade of 20th century, when transistor was assumed and there was a hugedevelopment of computers, people wanted to teach machines to recognize human languageand even speak with us. However, as we know, computers understand only the languageof bits and bytes (0-s and 1-s) and teaching them to understand us was not a trivial task,because any language has it’s own rules and exceptions. Nonetheless, people do not giveup and still are trying to succeed in it. In addition, there is a sub-field in linguistic andcomputer science, which is concerned about interactions between computers and humanlanguages, especially how to process and analyze natural language data. So, let’s have alook on steps, which were done by people during the last few decades in order to succeedin NLP.

Most historians trace the origins of this field to the beginning of the computer era, whenAlan Turing, in a 1950 work, described an intelligent machine that can easily interact with aperson through text on a screen. Therefore, machine-generated language is usually thoughtof as a digital phenomenon - as well as the primary goal of developing artificial intelligence(AI).

Until the 1980s, most of NLP systems were based on complex, handwritten rules. Someparticularly successful NLP systems invented in the 1960s were SHRDLU, a natural lan-guage system operating in bounded ”block worlds” with bounded vocabularies, and ELIZA,a Rogerian psychotherapist simulation written by Joseph Weisenbaum between 1964 and1966. Using almost no information about human thought or emotion, ELIZA sometimesproduced amazing human-like interactions. When having a correctly small knowledge base,ELIZA could provide a general answer, for example by on a question “My head hurts”,‘ELIZA could answer “Why do you say you have a headache?”

But in the late 1980s, a ”statistical revolution” in NLP came about. This was the resultof both the steady increase of computational power, and the shift to Machine Learningalgorithms. While some of the early Machine Learning algorithms (decision trees provide agood example) produced systems similar to the old school handwritten rules, research hasprogressively focused on statistical models. These statistical models are capable makingsoft, probabilistic decisions. In this period of time, the popularity of statistical models forNLP analyses rose dramatically. Methods like N-Grams have become useful, recognizingand tracking clumps of linguistic data, numerically.

4.1 Neural Networks

Since the statistical revolution in the between 1980s and 1990s, much NLP researches weremostly based on machine learning and nowadays they rely on ML even more because of thebig breakthrough subfield of machine learning called Deep Learning.

13

Although Recurrent Neural Networks (RNN, LSTM, GRU) were introduced in 1997,they became popular and started to be used only in the beginning of 2010, when deep neuralnetworks methods became widespread in NLP and computational powers were sufficient fortraining such algorithms. In 2001, Yoshio Bengio has published the first a feed-forwardneural network [2] trained for language modelling. The FFNN describes a neural networkthat does not use any connections to produce a cycle. In this architecture, sequences ofdata move only in one direction (straight-forward), from input to the output nodes throughhidden layers. The feed-forward neural network does not have any cycles or loops, and thusit is different from the RNNs.

Figure 2: Simple Feed-forward neural network

Nonetheless, feed-forward neural networks could not capture context from sequentialdata, which are essential for NLP and thus did not stay on the top for a long time. RNNs andtheir modified version like Long Short-Term Memory and Gated Recurrent Units becamemain models and have achieved state-of-the arts results in various NLP tasks [6]. Since then,lots of attempts were made to increase their efficiency by applying various techniques.

Figure 3: Basic RNN architecture

All of these architectures had common mechanism of the work under the hood - sequen-tial processing. Sentences were processed sequentially tokens after tokens with additionalhidden units. Therefore, network could capture the context of the sentence and make moreclever decisions.

This property was the reason, why RNNs and LSTMs were unable to be trained usingthe parallel computing and decrease training time. In order to calculate a vector for n-th token, we need to know the vector of the (n − 1)-th token, which also needs to becomputed from information of the previous token. Thus, parallel computing does not seempossible in this case. Following from this, training of the model, which can give more or lessappropriate accuracy, could take a lot of time. The reduction of this fundamental constraintof the sequential computation has been the main goal for most of NLP researchers who wasworking with RNNs.

The second biggest disadvantage of RNNs is their vanishing/exploding gradients. Thismeans that as length of sentence growth, then weights of last layers of the model couldexplode to enormously huge values (explode) or drop to zero (vanish) only because ofinappropriate initialization. Therefore, further training of the model is be impossible and

14

model can not catch long dependencies. In other words, the model is biased by most recentinputs in the sequence and older inputs will have practically no effect in the output.

4.2 Transformers

In 2015 Bahdanau et al [8] proposed a revolutionary mechanism, which became a oneof the most valuable breakthroughs in Deep Learning research in the last decade. It iscalled Attention. His neural machine translation was based on encoder-decoder RNN modeltogether with attention mechanism. Here is diagram of the model from Bahdanau’s paper:

Figure 4: Attention mechanism connected to LSTM units

However, this model did not accomplish impressive results, because RNNs were stillbe in use together with their drawbacks, although next researchers tried to eliminate it.Hopefully, two years later, in 2017, Vaswani et al in his paper ”Attention is all you need”[22] stated that attention is self-sufficient mechanism for working with text data and doesnot require any RNN in it’s architecture. Additionally Attention was redefined by provid-ing a very generic and broad definition of Attention based on Key, Query, and Valuevectors. Moreover, researchers proposed another concept called Multi-Headed Attention.These proposals became the biggest contribution for the modern NLP.

New architecture proposed by Vashani is called ”the transformers”. It is a first neuralnetwork model which relies completely on self-attention layers without any recurrent layersin it. It is a encoder-decoder neural network, where encoder part maps an input sequenceof symbol representations (x1, x2, ..., xn) into a vector (z1, z2, ..., zn), while decoder auto-regressively generates vectors (y1, y2, ..., ym) at a time i ∈ 1, 2, ...,m, which then are decodedto corresponding tokens.

Both encoder and decoder parts of the transformers are considered as a stack of N iden-tical layers, which are combination of Multi-Head Self-Attention mechanism and a simpleposition-wise fully-connected Feed-Forward Neural Network. As a result, this architecturereached 28.4 BLEU score (more than 2.0 BLEU ahead of the previous models) on theWMT 2014 English-to-German dataset (roughly 4.5 million sentence pairs) which was anew state-of-the-art score.

This event motivated other Deep Learning researchers to develop their own variations ofthe Transformer and apply them for other NLP tasks. In such a way, dozens of transformerarchitectures were developed in a last few years. The most essetial of them are BERT byGoogle Research team, GPT-2 from OpenAI, T5 from Google AI, and so on. Each of thesearchitectures has it’s own history and can solve different NLP problems.

15

Figure 5: the transformers - model architecture

Nowadays, we are on the edge of NLP development, when scientists around the worldregularly publish articles about transformers and propose various techniques to improvemodels’ performance, reduce sizes, beat NLP benchmarks and efficiently train on colossaldatasets.

5 Theory Behind Transformers

5.1 NLP Pipeline

Every machine learning pipeline consists of a number of steps, which should be done se-quentially. They are:

• Data gathering. Since, without a good dataset it is impossible to build a good model.Moreover, a simple algorithm trained on a well-gathered data performs much betterthat a model, which was trained on a data garbage.

• Data pre-processing. This step plays important role, because datasets could hideinside them useful information for models and data scientists, who can pull it out,deserve a well-paid position. In this step there is some difference between classicalmachine learning and NLP approaches, but they both share similar idea.

• Modelling and Evaluation. This part is a cherry on the pie, because here, we canchoose various algorithms which can be applied for the task on the given dataset.Here, we compare models by their complexity, accuracy (and other metrics) andefficiency in order to choose the best one.

• Hyperparameters tuning. When the top model was chosen, it requires to tune it’sparameters (learning rate, batch size, etc.) in order to improve it’s performance.

16

• Inference. Final step of the pipeline, where we can see how actually model works inreal life examples.

NLP pipeline is not a exception from ML pipelines, however it requires additional stepsto be applied for the data, because in contrast to modelling well-structured data, we haveto deal with raw texts, which mean absolutely nothing for computers. The first step ofpre-processing datasets is to split texts into small units of text (tokens), which then will beconverted into vector space (vectorization).

These two steps are the part of any NLP pipeline task and they play a crucial role,because a well-thought approach of tokenization and vectorization can help language modelsto understand their context much better and thus, to achieve better results in the end.

5.2 Tokenization

Tokenization is a method of splitting sentences in into list of small elements (tokens) likecharacters, words and word pieces. The idea looks quite simple. In English we can splita sentence every time we find a certain punctuation mark like space, comma, etc. Forexample, sentence ”This is a cat” can be tokenized as [This, is, a, cat] vector. However,similar words can be splitted resulting different tokens. ”Ilon Musk is a founder of Teslacompany. Musk’s company was founded in 2008”. In this example, simple word tokenizerwill create completely two different tokens for [founder] and [founded], however they arealmost the same word. As we see, even in this simple sentence, this task is not so trivial asit looks.

• Word Tokenization is the most commonly used tokenization algorithm. It splits apiece of text into individual words based on a certain characted. Based on differentdelimiters, different tokens can be formed. Pre-trained Word Embeddings such asWord2Vec and GloVe are built under the word-tokenization.

However, this tokenization method has few drawbacks. One of the major issues isnecessity to deal with Out Of Vocabulary (OOV) words. OOV words refer to the newwords which were not represented in the training dataset. They do not exist in thevocabulary. Hence, these methods fail in handling OOV words.

Another issue with word tokens is connected to the size of the vocabulary. Generally,pre-trained models are trained on a large volume of the text corpus. So, just imag-ine building the vocabulary with all the unique words in such a large corpus. Thisexplodes the vocabulary!

• Character Tokenization. This approach splits a sequence of text into a set ofcharacters. It overcomes the drawbacks which Word Tokenization has. However, hasit’s own issues. The length of the input and output sentences increases dramaticallyas we represent a sentence as a list of characters. As a result, it becomes challengingto learn the relationship between the characters to form meaningful words.

• Subword Tokenization. It is a golden center in between Word and Charactertokenizers. Subword tokenizer splits text into subwords (n-gram characters). Forexample, words like ”tokenizer” can be splitted into [token, *iz, *er], smarter into[smar, *ter] and so on.

Transformed based models mostly rely on Subword-Tokenization to prepare vocabulary.There are a few essential Subword-Tokenization algorithms. They are known as a Byte-Pair-Encoding (BPE).

17

Byte-Pair-Encoding (BPE) is widely used among transformer-based models. Speak-ing about BPE, it is a compression algorithm, which was introduced in 1994 by Philip Gage.BPE recursively merges the most frequently occurring character or character sequences inthe dataset. BPE overcomes the drawbacks of Word and Character Tokenizers:

• BPE tackles OOV by segmenting OOV as subwords and representing the word interms of these subwords;

• The length of input and output sequences after BPE are much shorter comparing tocharacter tokenization.

Steps to learn Byte-Pair-Encoding

1. Split the corpus into character level and appending ¡/w¿ in the end of each character;

2. Initialize the vocabulary with unique characters in the corpus;

3. Compute the frequency of a pair of characters or character sequences in the corpus;

4. Merge the most frequent pair in corpus together;

5. Save the best pair to the vocabulary;

6. Repeat steps 3 to 5 for a certain number of iterations or until it satisfies some condi-tions.

WordPiece WordPiece is the version of BPE used in BERT and DistilBERT. Thealgorithm was implemented in Japanese and Korean Voice Search by Schuster and theteam [17]. WordPiece initializes the vocabulary to include every character present in thetraining dataset and iteratively learns a given amount of merge rules. Comparing to BPE,WordPiece does not choose the most frequent character pair, but tries to maximize thelikelihood of the training data once added to the vocabulary.

Maximizing the likelihood of the training data is similar to searching for the characterpair, whose probability divided by the probabilities of its first character followed by itssecond symbol is the biggest among all character pairs. E.g. ”a”, followed by ”b” will bemerged only if the probability of ”ab” divided by ”a”, ”b” would be bigger than for anyother character pair. Intuitively, WordPiece is a quite different to BPE in that it evaluateswhat it loses by merging two symbols to make ensure it’s worth it.

5.3 Word Embeddings

After tokenizing input sentences, we have a sequence of token ids and vocabulary whichmaps these ids to their subwords. Now we need somehow to represent this sequence as anumber (or vector of numbers). We can simply assign each word in the text some uniquenumber or to apply one-hot encoding method. However, the dictionary of this corpus orthe resulting vector will be enormously huge.

As another option we can count how much each word occurs in the corpus and map toeach word its frequency. This method is called Bag of Words (BoW). Or we all can applymore complex algorithm term frequency-inverse document frequency (TF-IDF), which isoriginally a measure of originality of a word by comparing the number of tomes a wordappears in a doc with the number of docs the word appears in.

Still, the mapping method have a huge disadvantage: they do not capture the meaning ofwords or in other words, tokens which similar to each other by meaning will be mapped to avectors, which are completely not similar to each other. Thus, we come to word embedding,

18

a neural network, which can vectorize tokens in the way, where similar words will be closeto each other in a vector space as well. These networks are build on the idea, that ”theword is defined by its neighbors” and similar words should have similar neighboring words.

Word embeddings are in fact a class of techniques where individual words are representedas real-valued vectors in a predefined vector space. Each word is mapped to one vector andthe vector values are learned in a way that resembles a neural network, and hence thetechnique is often lumped into the field of Deep Learning.

Figure 6: Example of how words can be embedded to 2D space

Word embedding methods learn a real-valued vector representation for a predefinedfixed sized vocabulary from a corpus of text.

Word2Vec. It is a statistical method for efficiently learning a standalone word embed-ding from a text corpus. It was developed by Tomas Mikolov and the team in 2013 [10] asa response to make the neural-network-based training of the embedding more efficient andsince then has become the standard for developing pre-trained word embeddings in DeepLearning.

Figure 7: Word2Vec training models

We have two different learning methods which were introduced, can be used as part ofthe Word2Vec approach to learn the word embedding, they are:

19

• Continuous Bag-of-Words, or CBOW model. The model learns the embedding bypredicting the current word based on its context.

• Continuous Skip-Gram Model. This model learns by predicting the surrounding wordsgiven a current word.

Both models are focusing on learning about words given their local usage context, wherethe context is defined by a window of neighboring words. This window is a configurableparameter of the model.

Additionally with token embedding, NLP models often use positional embeddings aswell, in order to understand not just meaning of the word but also their position’s influence.Positional Embeddings are introduced for recovering position information. There are twocommonly used versions of postional embeddings: learned positional embeddings andsinusoidal positional embeddings. Both produce similar results:

PE(pos,2i) = sin(pos /100002i/dmodel)

PE(pos,2i+1) = cos(pos /100002i/dmodel)

These two types of embedding (Contextual and Positional) are then just summed to eachother to give the final embedding.

5.4 Sequence-To-Sequence Modelling

Sequence-to-sequence models (seq2seq) are Deep Learning models that have achieved out-standing performance in tasks such as machine translation, text summarization and so on.For example, Google Translate is using similar sea2seq model under the hood since theend of 2016. The foundations of seq2seq models start from 2014, when two articles werereleased - ”Sequence to Sequence Learning with Neural Networks” by Sutskever et al. [19],2014 and ”Learning Phrase Representations using RNN Encoder-Decoder for StatisticalMachine Translation” by Cho et al. [4]

Sequence-to-sequence model is an algorithm which takes a sequence of elements (words,characters, subwords, etc.) as an input and returns another sequence of elements. Theseq2seq model works as follows:

Figure 8: Sequence-to-sequence modelling

In machine translation, a sequence of elements is a list of tokens that are processed oneat a time. The output is also a list of tokens. The model has encoder and a decoder blocksinside. The encoder processes each element of the input sequence, converts the it into avector called a context. After the entire input sequence is processed, the encoder sends thecontext to the decoder, which then generates the output sequence iteratively token aftertoken. With regard to machine translation, the context is word embedding

Before processing words, you need to convert them to vectors. This transformation isdone using the word embedding algorithm. You can use both pre-trained embeddings andtrain embeddings on your own dataset. 200-300 is a typical dimension of the embeddingvector

20

Figure 9: Sequence-to-sequence modelling (enoder + decoder)

5.5 The Transformer

The Transformer is a deep neural network architecture introduced in 2017 by researchersfrom Google Brain. By analogy with RNNs, transformers are designed to process sequencessuch as natural language text and solve problems such as machine translation and automaticsummarization. Unlike RNS, transformers do not require sequence processing. For example,if the input data is text, then the transformers does not need to process the end of the textafter processing its beginning. Thanks to this, transformers are parallelized more easilythan RNS and can be trained faster. The Transformer can consist of encoder-decoder(seq2seq) blocks or of one of them (only encoder block - autoencoding model, only decoderblock - autoregressive model). Let’s look on the block deeply.

Figure 10: the transformers - model architecture

5.5.1 Encoder Block

Let us first consider an encoder. It is a part of the network that receives sequence of dataas input and issues some embeddings corresponding to the words that then will be used bythe decoder block.

The idea is that each token is passed in parallel through the layers shown in the Figure11. Some of them are standard fully-connected layers, some are shortcut connections like inResNet (where Add is in the picture). But the crucial layer here is a Multi-head attention(Figure 12). It is a special layer that allows each input vector to interact with other input

21

Figure 11: Encoder block of Transformer

vectors through the attention mechanism, instead of passing the hidden state as in RNNor neighboring words as in convolutional neural networks (CNN).

Figure 12: Multi-Head Attention

It is given vectors Query as input, and several pairs of Key and Value. Each of them istransformed by a trained linear transformation, and then the scalar product Q is calculatedwith all Keys in turn, the result of these scalar products then runs through softmax function,and with the resulting weights, all vectors V are summed into a single vector.

The only thing that they supplement to it is that there are several such attentions inparallel training (their number in the Figure 12 is indicated by h), in other words, multiplelinear transformations and parallel dot products divided by weighted sums. And then theresult of all these parallel attentions is concatenated, once again run through the trainedlinear transformation and goes to the output.

But in general, each such module receives a Query vector and a set of vectors for Keyand Value as input, and produces one vector of the same size as each of the inputs. Sincesuch a block outputs a vector of the same size as it was at the input, this block can beinserted into the network several times, adding depth to the network. In practice, theyuse a combination of Multi-head attention, residual layer and fully-connected layer 6 times,

22

which is such a deep enough network.

The last thing to say is that one of the features of each word is positional encoding -i.e. his position in the proposal. For example, this makes it easy to ”pay attention” toneighboring words during word processing, if they are important.

They use as such a feature a vector of the same size as the word vector, and whichcontains sine and cosine from a position with different periods, so that they say it is easyto pay attention to relative offsets by choosing a coordinate with the desired period.

They also have LayerNormalization. This is a normalization procedure that normalizesthe outputs from all neurons in the rail inside each sample (unlike each neuron separatelyinside the batch, as in Batch Normalization, apparently because they did not like BN).

Let’s try to summarize the work of the encoder point by point.

Encoder operations:

1. Embeddings are made for all words of the sentence (vectors of the same dimension).For example, let this be the sentence I am mathematician. The position of the wordin the sentence is also added to the embedding.

2. The vector of the first word and the vector of the second word (I, am) are taken, fedto a single-layer network with one output, which gives the degree of their similarity(scalar value). This scalar value is multiplied by the vector of the second word, gettingits some copy ”weakened” by the similarity value.

3. Instead of the second word, the third word is given and the same is done as in para-graph 2. with the same network with the same weights (for vectors I, mathematician).

4. Doing the same for all the remaining words of the sentence, their ”weakened” (weighted)copies are obtained, which express the degree of their similarity to the first word. Fur-ther, these all weighted vectors are added to each other, obtaining one resulting vectorof the dimension of one embedding:

output = am * weight (I, am) + mathematician * weight (I, mathematician)

This is the ”normal” attention mechanism.

5. Since the assessment of the similarity of words in just one way (according to onecriterion) is considered insufficient, the same (paragraphs 2-4) is repeated severaltimes with different weights. Type one attention can determine the similarity ofwords by semantic meaning, the other by grammatical, the rest somehow, etc.

6. At the exit point 5. several vectors are obtained, each of which is the weighted sumof all the other words of the sentence regarding their similarity to the first word (I ).Let’s concatenate this vector into one.

7. Then one more layer of linear transformation is put, which reduces the dimension ofthe result of item 6. up to the dimension of the vector of one embedding. It turnsout a kind of representation of the first word of the sentence, made up of weightedvectors of all the other words in the sentence.

8. The same process is done for all other words in the sentence.

9. Since the output dimension is the same, you can do the same thing again (items 2-8), but instead of the original word embeddings, take what is obtained after passingthrough this Multi-head attention, and take the attenuation neural networks insidewith different weights (weights between layers are not common). And there are manysuch layers (Google has 6). However, between the first and second layers, a fully con-nected layer and residual connections are added to add expressiveness to the network.

23

And as a result, for each word, the final output is obtained - embedding, at which thedecoder will look.

5.5.2 Decoder Block

The decoder is also launched one word at a time, receives the previous word as input andmust issue the next one (at the first iteration, it receives a special ¡start¿ token).

Figure 13: Decoder block of Transformer

There are two different types of use of Multi-head attention in the decoder:

• The first is the ability to refer to vectors of past decoded words, just as it was duringthe encoding process (but you can refer not to all, but only to those already decoded).

• The second is the ability to access the encoder output. In this case, Query is theinput vector in the decoder, and the Key/Value pairs are the final encoder embeddings,where again the same vector goes both as key and as value (but linear transformationsinside the attention module are different for them)

In the middle, there is still just an FC layer, again the same residual connections andlayer normalization.

And all this is repeated 6 times again, where the output of the previous block goes tothe input of the next one.

Finally, at the end of the network softmax is applied as usual giving the probabilitiesof the words. This gives us the next word in the sentence. We provide this word as aninput to the next round of the decoder and the process repeats until the decoder outputsthe token end of sentence [EOS].

24

5.5.3 Summary

During encoding, each vector interacts with all the others. During decoding, each nextword interacts with the previous ones and with the encoder vectors.

In general, the main innovation is the use of the self-attention mechanism to interactwith other words in the sentence instead of the RNN or CNN mechanisms. Author theorizethat it helps because the network can easily access any information, regardless of the lengthof the context - referring to the past word or to the word which is 10 steps back, is equallyeasy. This makes the Transformer easier to learn, and easy to carry out calculations inparallel, unlike RNN, where you need to do each step in turn.

6 NLP Tasks Overview

NLP is a subfield of Artificial Intelligence, which helps computers to understand and inter-pret human languages. NLP allows computers to communicate with people, using a humanlanguage. NLP also provides computers with the ability to read text and interpret it. NLPis comming from a number of disciplines like computational linguistics and computer scienceand attempts to close the gap between human and computer communications. Although,NLP is a broad field of studies, it can be grouped into several tasks, which researchers aredealing with. Searching, Machine translation, Summarization, Named-Entity Recognition,Information retrieval, Information grouping, Answering queries, Automated speech recog-nition, Sentiment analysis is just a short list of challenges that NLP has and even sometasks can be divided into smaller challenges. For example, famous benchmark for evaluatingand analyzing NLP systems GLUE (General Language Understanding Evaluation) counts9 main challenges which are described in Figure 14.

Figure 14: GLUE NLP tasks

Nonetheless, most of them are quite similar to each other from technical side, whileI was interested in the challenges which are differ from each other as much as possible.Therefore I was working on sequence classification, language modelling, question answeringand translation. All of them share similar structure of base modelling, but have differencesin final layers, dataset structure, training and evaluation parts:

• Sequence Classification. Sequence classification is the task of classifying sequencesaccording to a given number of classes. An example of sequence classification is the

25

GLUE dataset, which is entirely based on that task. Identifies the general mood, orsubjective opinions, stored in large amounts of text. Useful for opinion mining.

• Question Answering. Extractive Question Answering is the task of extracting ananswer from a text given a question. An example of a question answering dataset isthe SQuAD dataset, which is entirely based on that task.

• Neural Machine Translation. This typically involves translating one natural lan-guage into another, preserving the meaning and producing fluent text as a result.In my research I mostly was interested in the Enlish-German and English-Russianlanguage pairs, which has a lot of rich datasets and are easy to train comparing toother language pairs.

• Language Modelling Language modeling is the task of fitting a model to a corpus,which can be domain specific. There are two version of LM: Masked-Language Mod-elling and Casual Language Modelling. As GPT-2 is trained using a causal languagemodeling, let’s briefly describe this method:

Causal Language Modeling is the task of predicting the token following a sequenceof tokens. In this situation, the model only attends to the left context (tokens on theleft of the mask). Such a training is particularly interesting for generation tasks.

According to Wikipedia definition, language model is a probability distribution ofword over sequences of words P (wi|w1, w2, ..., wi−1), where i is a length of the sequenceand wi is a next word in the sequence. Generally speaking Language Model (LM)can be defined as a machine learning, which based on a given part of the sentence canpredict next word in this sequence. Such models can be seen on modern smartphonekeyboards, where most probable words are suggested next based on what user alreadytyped. Though, language models are rarely used directly, they play a crucial role inother real-world applications. Text generation is one of them, which is basically auto-regressive launch of LM with some stopping criteria. So, mathematically speaking,language models are kind of Markov chains, where i-th word is dependent on theprevious i − 1 words. And our goal is optimize model’s parameter in the way thatmost appropriate words would have a bigger probabilities.

Language modeling can be useful outside of pre-training as well, for example to shiftthe model distribution to be domain-specific: using a language model trained over avery large corpus, and then fine-tuning it to a news dataset or on scientific papers.

26

7 Transformers Overview

7.1 BERT

BERT (Bidirectional Encoder Representations from transformers) NLP algorithm devel-oped by researchers at Google AI Language. It has aroused interest in the NLP communityby presenting state-of-the-art results in a various NLP tasks, including Sentiment Analysis,Question Answering and others.

BERT’s essential innovation is a bidirectional training of Transformer. Previous imple-mentations were looking at a text sequence in one direction from left-to-right, while BERTis using combined left-to-right and right-to-left training option (actual it does not look atdirection and better to call it non-directional rather than bidirectional training). Results ofthe algorithm show that a model that is trained bidirectionally can have a deeper sense oflanguage context than one-directional models. Additionally, the developers explain in de-tails a new technique named Masked Language Modelling (MLM) that allows bidirectionaltraining in models where it was impossible previously.

BERT uses an attention mechanism that learns contextual relations between tokens ina text. In its basic form, the transformers includes two separate mechanisms — an encoderthat takes the text as an input and a decoder which produces a prediction for the taskbased on output gained from the encoder. BERT’s main goal is to generate a languagemodel, therefore only the encoder block is used.

When language models is being trained, there is an issue to define a prediction goal.Some algorithms are trained to predict the next word in a sequence (like GPT) in a direc-tional approach which limits learning of context. To overcome this problem, BERT usestwo training strategies: Masked Language Modelling (MLM) and Next Sentence Prediction(NSP). Let’s discuss each of them in details.

7.1.1 Masked Language Modelling

Before feeding a sequence of words into BERT, 15% of the tokens in each sequence arereplaced with a [MASK] token. The algorithm then tries to predict the initial value of themasked tokens, based on the context provided by the non-masked words in the sequence.

Figure 15: BERT MLM architecture

To implement this idea, we need to meet the following requirements:

• To add a classification (Dense/Linear) layer or fully-connected feed-forward classifieron top of the encoder output.

27

• To multiply the output vectors by the embedding matrix, transforming them into thevocabulary dimension.

• To calculate probability of each word in the vocabulary by applying softmax function.

BERT’s loss function takes into account only the prediction of the masked tokens ignor-ing the prediction of the non-masked tokens. As a result, the algorithm converges 6 timesslower than one-directional models, but gains an increased context awareness.

7.1.2 Next Sentence Prediction

In the BERT training process, the model receives pairs of sentences as input and learns topredict if the second sentence in the pair with the first sentence. During training, half ofthe input are sentence pairs in which the second sentence is the subsequent sentence of thefirst one, while in the second half a random sentence from the corpus is chosen for everysentence as a pair. The assumption is that the random sentence will be contextually farfrom the original document.

In order too help BERT to distinguish between the two sentences during the training,input data is processed in the following manned before going to the model:

• A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token isinserted at the end of each sentence.

• A sentence embedding indicating Sentence A or Sentence B is added to each token.Sentence embeddings are similar in concept to token embeddings with a vocabularyof 2.

• A positional embedding is added to each token to indicate its position in the sequence.

Figure 16: BERT NSP

To predict if the second sentence is indeed connected to the first, the following steps areperformed:

• The entire input sequence goes through the transformers model.

• The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simpleclassification layer (learned matrices of weights and biases).

• Calculating the probability of IsNextSequence with softmax.

When the researchers were training the BERT, they trained MLM and NSP together, withthe aim to minimize the combined loss function of both strategies.

In my research, I have been using BERT for training on two tasks - Sentiment analysisand Question answering, where first challenge is a simple classfication task (similar tomasked language modelling, where the goal is to predict value of [CLS] token, where 1means positive and 0 - negative sentiment on a given sentence) and second one is to predictthe token, where prediction starts and the length of the answer given the context.

28

7.2 XLNet

XLNet is an auto-regressive language model which outputs the joint probability of a se-quence of words based on the transformers encoder block’s architecture with recurrence.The model’s training objective calculates the probability of a token conditioned on all per-mutations of tokens in a sentence, as opposed to just those to the left or just those to theright of the target token.

transformers have a huge disadvantage: they operate only on sequences with a fixedlength. What if we know that token [New] should occur in the sentence [York, is, a, city],but it also required for the model to know something about the Empire State building thatwas mentioned in the previous sentence? XLNet handles this drawback by allowing thecurrent sequence to get information from previous sequences.

transformers like BERT achieved on the state-of-the-art results by incorporating bothleft and right contexts into predictions. XLNet has gone further: the model’s contributionis to predict each word in a sequence using any combination of other words in that sequence.

The researchers of XLNet had trained the model on over 130 gigabytes of textual datawith use of 512 TPU chips running for more than two days, which is much larger thantraining BERT. XLNet’s main invention is not it’s architecture, but a modified approachtraining LM objective which learns conditional distributions for all permutations of tokensin a sequence.

In practice, XLNet samples from all possible permutations, so it does not get to seeevery single relation. It also does not use very small contexts as they are found to hindertraining. After applying these practical heuristics, it bears more of a resemblance to BERT.

7.2.1 Permutation

Given a sequence x, an auto-regressive model is one which calculates the probabilityPr(xi) = Pr(xi|x1, x2, ..., xi−1). In language modelling, this is the probability of a token xiin the sentence, conditioned on the tokens x1, x2, ...xi−1 preceding it. These conditioningwords are referred to as the context. Such type of a model is asymmetric and can nottechnically learn from all token relations in the corpus. AR models like ELMO also allowto learn from relations between a given token and those following it. The auto-regressivemodel’s goal in this case is Pr(xi) = Pr(xi|xi+1, ..., xn). It is auto-regressive in the reversedsequence. But XLNet’s researchers did not stop there. They though that there could besome interesting relations to learn from if we look at just the two nearest tokens. XLNetproposes to use an objective which is an expectation over all permutations in equations:

Θ = argminθ

[Ez∼Z [

T∑t=1

log[Pr(xz[t]|xz[,t]]]]

This criterion finds model parameters Θ to maximize the probability of tokens xz[t] in asequence of length T given preceding elements xz[<t], where z[t] is the t-th element of thepermutation z of the token indices and z[< t] are the previous tokens in the permutation.The sum of logits means that for any one permutation the model is properly auto-regressiveas it is the product of the probability for each element in the sequence. The expectation overall the permutations in Z shows the model is trained to be equally capable of computingprobabilities for any token given any context.

7.2.2 Attention mask

But how does the model know about the order of word? For example, let’s look on thesentence ”This is a research”. XLNet can compute both Pr([This]|[is]) and Pr([This]|[a]).

29

But it need to have some information about the relative position of tokens [This], [is] and[a]. Otherwise it would just think all words in the sentence are equally likely to be nextto one another. Researchers solved this problem by adding positional values, with whichthe model predicts Pr(This|is, 2) and Pr(This|a, 3). In other words, it should know theindices of the context tokens.

XLNet’s architecture adds positional information to token embeddings. It cab bethought of the training objective terms as a sum Pr(This|is + 2). But if the sentencetokens are really shuffled, this mechanism would break. This problem is handled by usingan attention mask. When the model computes the context which is the input to the prob-ability calculation, it always does so using the same token order, and simply masks thosenot in the context under consideration (i.e. those that come subsequently in the shuffledorder).

7.2.3 Two-Stream Self-Attention

Still, there is one oversight to mention: we not only want the probability to be conditionedon the context token indices, but also the index of the token whose probability is beingcalculated. In other words we want Pr(This|1, is+ 2): the probability of [This] given thatit is the 1st token and that [is] is the 2nd token. But the transformers architecture encodesthe positional information 1 and 2 within the embedding for [This]and [is]. So this wouldlook like Pr((This|This + 1, is + 2). Unfortunately, the model now trivially knows that[This] is part of the sentence and should be likely.

To solve this problem we can use a two-stream self-attention mechanism. Each tokenat position i, has two associated vectors at each self-attention layer m : hmi and gmi . Theh vectors belong to the content stream, while the g vectors belong to the query stream.The content stream vectors are initialized with token embeddings summed with positionalembeddings. The query stream vectors are initialized with a generic embedding vector wadded to positional embeddings. Note that w is the same no matter the token, and thuscannot be used to distinguish between tokens.

At each layer, each content vector, hi, is updated using those h’s that remain unmaskedand itself (equivalent to unmasking the diagonal from the matrix shown in the previoussection). Thus, h3 is updated with the mask [0, 0, 1, 0], while h2 is updated with the mask[0, 1, 1, 0]. The update uses the content vectors as the query, key and value.

By contrast, at each layer each query vector gi is updated using the unmasked contentvectors and itself. The update uses gi as the query while it uses hj ’s as the keys and values,where j is the index of an unmasked token in the context of i

7.3 T5

Researchers of the article ”Exploring the Limitations of Transfer Learning in a CombinedText-to-Text Transformer” presented a large-scale empirical research, where they tried toidentify the best transfer learning methods and apply gained insights in order to create anew model - T5 (Text-To-Text Transfer Transformer). A new open-sourced training hugedataset, Colossal Clean Crawled Corpus (C4), has also been proposed by the team. T5was pre-trained on the C4 corpus and became the best in various NLP benchmarks, whilehaving the flexibility to fine-tune it for specific NLP tasks.

Along with T5, the authors rethought all NLP challenges and represent them in text-to-text format, where the input and output of the model are represented by text sequences,in contrast to autoencoding models like BERT, which output either a target label or a partof the input sequence. The proposed framework allows using the same model, loss functionand hyperparameters for any NLP task such as machine translation (in green), document

30

summarization (in blue), linguistic consistency (in red), semantic proximity of sentences(in yellow) and so on. T5 can even be used for a regression task by predicting the stringrepresentation of the target number rather than the number itself.

Figure 17: text-to-text task represented by T5

7.3.1 A Huge Dataset (C4)

A crucial part of transfer-learning is necessity to have an unlabeled dataset for pre-training.Accurately measuring the scaling efficiency of pre-training requires high-quality and varieddata, and large amounts of it. Previously existed datasets did not meet all three criteriaat once: for example, Wikipedia texts are usually of high quality, but stylistically rathermonotonous and have a rather modest total volume. At the same time, Common Crawltexts are simply huge in size and very diverse in composition, but their quality leaves muchto be desired.

To meet the requirements described above, Colossal Clean Crawled Corpus (C4)was developed by the team, which is a cleaned-up version of Common Crawl that is twoorders of magnitude larger than Wikipedia. The process of cleaning up the dataset includedremoving duplicates, incomplete sentences, and inappropriate or junk texts. Such filteringcontributed to obtaining better results in applied problems, while the large volume allowedincreasing the size of the model without the risk of overfitting.

7.3.2 A Systematic Study Of Transfer Learning Methodology

Using the proposed T5 framework and C4 dataset, the authors have been able to explore afairly large number of ideas and methods proposed in the field of transfer learning in NLPin the past few years. All research details can be found in the related article, includingexperiments with:

• Architecture, in the course of which it was found that models with an encoder anddecoder are usually better than to language models with a single decoder or encoderblocks;

• Pre-learning objectives, which confirmed that fill-the-gap problems (where themodel is trained to recover missing words in the input text) are the best solution andthat the most important factor is computation cost;

• Unlabeled datasets, which showed that training on data from a specific subjectarea can be beneficial, but that pre-training on small datasets can lead to overfitting;

• Learning strategies, during which it turned out that multitasking training canbe comparable to preliminary training and subsequent fine-tuning, but the formerrequires careful selection of the frequency of training the model for a specific task;

31

• Scales that compared the increase in model size, training time, and the number ofmodels in the ensemble to determine the most efficient use of available computingpower.

These are a few tasks, which were overthought by the authors of T5 in order to convertthem to text-to-text modelling methodology:

• Question Answering One of the options for using the text-to-text framework is tounderstand the text read: the model is given some context and a question, the answerto which it needs to learn to find in a given context. For example, you might submita Wikipedia article about Hurricane Connie as input to the model, along with thequestion, ”When did Hurricane Connie happen?” After that, the model will learn tofind the required date in the article: ”August 3, 1955”. This approach helped to getthe best results on the SQuAD dataset.

The authors of T5 train the model to answer trivial questions in a more complexway where it has no access to external knowledge. In other words, to answer thequestion, T5 can only use the knowledge that is stored in the parameters acquiredduring unsupervised pre-teaching. This can be seen as a limited version of the generalquestion and answer system.

Figure 18: T5 pre-training and fine-tuning

• Text Generation Large auto-regressive language models like GPT-2 do an excellentjob of generating fairly realistic-looking texts, as they are trained to predict the nextword of the input sequence. This gave rise to numerous unusual applications. Thetask of preliminary training T5 is rather reduced to the task of generating text to fillin the gaps, when the model must predict the missing word in a specially ”spoiled”piece of text. This, in turn, is a generalization of the sequence continuation problem,since ”Gaps” may also appear at the end of a sentence.

To take advantage of this prior learning, a ”sized fill-in-the-blank” task was developedin which the model must fill in a specified number of words instead of a gap. Forexample, if we pass the sentence “I like to eat peanut butter and 4 sandwiches” tothe model, we will train it to insert about 4 words into the gap.

After fine tuning the T5 for this task using the C4 dataset, the final output becamequite realistic. It is especially interesting to see how the model fits its predictionsto a given number of missing words. For example, when submitting such an inputsentence: ”I love peanut butter and N sandwiches”, where N is a variable parameter.

7.4 GPT-2

OpenAI’s GPT-2 model was proposed by Alec Radford et al in the paper ”ImprovingLanguage Understanding by Generative Pre-Training” [12], which is a unidirectional trans-former pre-trained on a large text corpus (the Toronto Book Corpus) with long range

32

Figure 19: T5 on text generation task

dependencies. GPT-2 belong to auto-regressive models, i.e. it generates one token at atime. The model’s architecture consists only from decoder block of the initial Transformerarchitecture. On the Figure 20 you can find an architecture (left) and training objectivesused in the work (right).

Figure 20: GPT-2 architecture

After a while OpenAI team introduced new version of this model - GPT-2, which wastrained on a bigger dataset (more than 40GB of Internet text). Actually they releasedfamily of models, which differ only in sizes of networks, i.e. number of decoder stacks init and the dimensionality of input vectors. They are small, medium, large and extra largewith 117M, 345M, 762M and 1.5B parameters respectively (Figure 21 for comparison).Enlargement of datasets and networks size played it’s role and it’s scores on various tasks(Winograd Schema, LAMBADA and others) broke the record and became state-of-the-artmodel at one time.

Figure 21: GPT-2 Versions

Fortunately, artificial intelligence research team from San Francisco did not stop on themodel’s enlargement. In May 2020, OpenAI introduced the 3rd generation of their autore-

33

gressive language model - GPT-3 with an outstanding capacity of 175 billion machinelearning parameters (Figure 22 for comparison with other models), 96 decoder layers and12288 token input vector space, which became the biggest pre-trained model of currentlyexisting ones. GPT-3 was described as one of the most interesting and important AI sys-tems ever produced. Also, text generated by GPT-3 is really hard to distinguish fromhuman written text, researchers and engineers of OpenAI were worried that their model ispotentially dangerous and can used for malicious text generations.

Figure 22: Number of parameters (in millions) of existing transformer models

However, for today usage of GPT-3 is limited and can be accessed only via its API.Moreover, running such a monster requires enormous computational capabilities. Therefore,I have decided to work with the second generation of GPT, especially the version with 117million parameters that fits all of my requiremest for the research.

7.4.1 GPT-2 Architecture

In this research I have used the smallest version of GPT-2, though other versions differfrom smallest one only by their sizes (number of parameters).

The GPT-2 is built with only Transformer’s decoder blocks because of it’s auto-regressivenature and each token is produced at a time, added to the sequence of inputs for furtherproduction of next tokens. Let’s take the first law of robotics as an example for explainingthis idea:

A robot may not injure a human being or, through inaction, allow a human beingto come to harm.

As a first step, when first part of the sentence is included, GPT-2 predicts the nexttoken, while on the second step, the predicted word is being added into the input and thenrepeat these steps recursively:

Let’s look deeper into GPT-2 (Figure 24). Each decoder block in the model consists ofMasked Self-Attention Layer and Feed-Forward Neural Network connected to it. Then all12 blocks are stacked together and final decoder produced prediction for the next word inthe sequence.

Each decoder block consists of Masked Self-Attention layer and Feed Forward NN. At-tention mechanism is simply next formula:

A(Q,K, V ) = softmax(QKT

√dk

)V

where W, K, V are Queries, Keys and Values matrices respectively.

34

Figure 23: GPT-2 Auto-regression mechanism

Figure 24: GPT-2 Arhitecture

35

7.5 Distilled versions of BERT and GPT-2

Usually based on the transformers architecture of Vaswani et al., pre-trained languagemodels keep getting larger and larger and being trained on bigger datasets. However, asthese models were reaching a larger NLP community, an important and challenging questionstarted to emerge. How should we put these monsters in production? How can we use suchlarge models under low latency constraints? Do we need (costly) GPU servers to serveat scale? To reduce the computational times of BERT or similar the Transformer basedmodels, a natural choice is to use a smaller network to approximate the performance. Thereare many techniques available to tackle the previous questions. The most common toolsinclude quantization (approximating the weights of a network with a smaller precision) andweights pruning (removing some connections in the network). However, both these resultin lower prediction metrics.

Figure 25: Comparison of Transfomers sizes (in millions of parameters)

But there is also a knowledge distillation (sometimes also referred to as teacher-studentlearning), that is a compression technique in which a small model is trained to reproducethe behavior of a larger model (or an ensemble of models). It was introduced by Bucilaet al. and generalized by Hinton et al. In the teacher-student training, we train a studentnetwork to mimic the full output distribution of the teacher network (its knowledge).

The key idea of distillation is loss function of the student model. Rather than trainingwith a cross-entropy over the hard targets (one-hot encoding of the gold class), in distillationthe authors transferred the knowledge from the teacher to the student model with a cross-entropy over the soft targets (probabilities of the teacher). The training loss thus looks likethis:

L = −∑i

ti ∗ log(si)

where ti is the logit from the teacher and si us the logit of the student.

To further expose the mass of the distribution over the classes, Hinton et al. introducea softmax-temperature:

pi =exp( ziT )∑

jziT

36

with temperature parameter T .

When T −→ 0, the distribution becomes a Kronecker (and is equivalent to the one-hottarget vector), when T −→ +∞, it becomes a uniform distribution. The same temperatureparameter is applied to both the student and the teacher at training time, further revealingmore signals for each training example. At inference, T is set to 1 and recover the standardSoftmax. In order to compress a large language model using distilling. For distilling, theKullback-Leibler loss was used since the optimizations are equivalent:

KL(p||q) = Ep(log(p

q)) =

∑i

pi ∗ log(pi)−∑i

pi ∗ log(qi)

Using the teacher signal, we are able to train a smaller language model, we call Distil-BERT and DistilGPT-2, from the supervision of BERT and GPT-2 respectively.

In case of DistilBERT, Hinton et al. described the training loss as a linear combinationof the distillation loss and the masked language modeling loss. The student model is a smallversion of BERT in which the token-type embeddings and the pooler (used for the nextsentence classification task) are removed and the rest of the architecture is kept identicalwith small reduce the numbers of layers by a factor of two.

37

8 Evaluation Metrics

In any machine learning task, metrics are used to assess the quality of models and comparevarious algorithms. Their selection and analysis is an indispensable part of any data scien-tist’s job. Here, I would like to describe some quality criteria I used in NLP problems andinvestigate what is important in choosing a metric and what can go wrong.

Since each NLP task is different firstly by the output, it makes sense to evaluate theseoutput in different ways. Sequence classification, from this side seems as an easiest architec-ture for evaluation, because it is absolute the same as for any classification model in othermachine learning problems. From the other, side evaluating models, which deal with textdata is not an easy task, therefore there are metrics developed especially for specific NLPtasks like BLEU score for translation or perplexity for language modelling. So, metrics foreach of the task investigated in the research is explained below.

8.1 Classification Metrics

Before moving on to the metrics themselves, it is necessary to introduce an importantconcept for describing these metrics in terms of classification errors - the confusion matrix.Suppose we have two target classes and an algorithm that predicts the belonging of eachobject to one of the classes, then the classification error matrix will look like this:

Figure 26: Confusion matrix

Here, classification errors are of two types: False Negative (FN) and False Positive (FP).

8.1.1 Accuracy

An intuitive, obvious and most popular classification metric is accuracy - the percentage ofcorrect answers of the algorithm:

accuracy =TP + TN

TP + TN + FP + FN

This metric is useful in the dataset with unbalanced targets. To overcome this, we will behelped by the transition from a common metric for all classes to separate indicators of thequality of classes.

38

8.1.2 Precision, Recall, F-measure

To assess the performance of the algorithm on each of the classes separately, we introducethe precision and recall metrics.

precision =TP

TP + TN

recall =TP

TP + FN

Precision can be interpreted as the proportion of objects called positive by the classifier andat the same time really positive, and recall shows what proportion of objects of a positiveclass from all objects of a positive class the algorithm found.

Precision and recall do not depend, in contrast to accuracy, on the ratio of classes andtherefore are applicable in conditions of unbalanced samples. But these are two metric andwhat if we need to have something, which measures precision and recall together?

There are several different ways to combine precision and recall into an aggregatedmeasure of quality. F-measure (Fβ in general) - harmonic mean of precision and recall:

Fβ = (1 + β2) ∗ precision ∗ recall

(β2 ∗ precision) + recall

The value of β in this case determines the weight of the exactness in the metric, andwhen β = 1 it is a harmonic mean (with a factor of 2 in order to have F1 = 1 withprecision = 1 and recall = 1). The F-measure reaches its maximum when the recall andprecision are equal to one, and is close to zero if one of the arguments is close to zero.

8.2 BLEU score

BLEU (BiLingual Evaluation Understudy) is a metric for evaluating machine-translatedtext. The BLEU score is a number which lies between zero and one, and measures thesimilarity of the machine-translated text to a set of high quality reference translations. Avalue of 0 means that the machine-translated output has no overlap with the referencetranslation (low quality) while a value of 1 means there is perfect overlap with the referencetranslations (high quality).

It has been shown that BLEU scores correlate well with human judgment of translationquality. Note that even human translators do not achieve a perfect score of one. Tryingto compare BLEU scores across different corpora and languages is strongly discouraged.Even comparing BLEU scores for the same corpus but with different numbers of referencetranslations can be different.

However, as a rough guideline, the following interpretation of BLEU scores (expressedas percentages rather than decimals) might be helpful.

Mathematically, the BLEU score is defined as:

BLEU = min(1, exp(1− reference− length

output− length)(

4∏i=1

precisioni)14

with

precisioni =

∑snt∈Cand-Corpus

∑i∈snt min(mi

cand,miref)

wit =∑snt′∈Cand-Corpus

∑i′∈snt′ m

i′cand

where

39

BLEU Score Interpretation

less than 10 Almost useless10-19 Hard to get the gist20-29 The gist is clear, but has significant grammatical errors

30 - 40 Understandable to good translations40 - 50 High quality translations50 - 60 Very high quality, adequate, and fluent translations

more than 60 Quality often better than human

Table 1: BLEU score understanding interpretation

• micand is the count of i-gram in candidate matching the reference translation;

• miref is the count of i-gram in the reference translation;

• wit is the total number of i-grams in candidate translation.

The formula consists of two parts: the brevity penalty and the n-gram overlap.

• Brevity Penalty. The brevity penalty penalizes generated translations that are tooshort compared to the closest reference length with an exponential decay. The brevitypenalty compensates for the fact that the BLEU score has no recall term.

• N-Gram Overlap. The n-gram overlap counts how many unigrams, bigrams, tri-grams, and four-grams (i = 1, ..., 4) match their n-gram counterpart in the referencetranslations. This term acts as a precision metric. Unigrams account for adequacywhile longer n-grams account for fluency of the translation. To avoid overcounting,the n-gram counts are clipped to the maximal n-gram count occurring in the reference.

8.3 Perplexity

Perplexity is a common metric to use when evaluating language models.

PP (W ) = N

√1

P (w1, w2, ..., wN )

where wi is the embedding of the i-th token in the input sequence.

First of all, what makes a good language model? The aim of the model is to assign highprobabilities to sentences that are syntactically correct, and low probabilities to incorrect orinfrequent sentences. Assuming that dataset is made of sentences that are in fact real andcorrect, this means that the best model will be the one that assigns the highest probabilityto the test set. Intuitively, if a model assigns a high probability to the test set, it meansthat it is not surprised to see it (it’s not perplexed by it), which means that it has a goodunderstanding of how the language works.

We can interpret perplexity as the inverse probability of the test set, normalised by thenumber of words in the test set:

PP (W ) =1

P (w1, w2, ..., wN )1N

= N

√1

P (w1, w2, ..., wN )

Since the metric is the inverse probability, a lower perplexity means a better model.

40

Perplexity can also be defined as the exponential of the cross-entropy H(W ), which Ihave been using in my experiments with language modelling:

PP (W ) = 2H(W ) = 2−1N log2 P (w1,w2,...,wN )

We can say that cross-entropy H(W ) is the average number of bits needed to encode eachword. This means that the perplexity 2H(W ) is the average number of words that can beencoded using H(W ) bits.

9 Experiments

This part is about practice, where I have trained models for various task listed previouslyand made experiments on them.

To train models is used the next tools:

• Python [21]. Nowadays, this programming language is essential for data scienceand Deep Learning. Every modern neural network architecture is written by Python.Therefore, it was obvious for me to use it. There are two options to use Python in ma-chine learning pipeline: Jupyter notebook and simple scripting. I’ve been using bothmethods but for different parts of the research. Jupyter notebook is very convenientfor making simple experiments on the model, learning how algorithms behave and toprepare main scripts. Jupyter notebook has these advantages because of it interactivemode and absence of necessity to rerun the whole code every time, when only partof it is required to be executed. Additionally, I used Jupyter notebook for interferingwith transformers. On the other hand, when code is almost ready, it’s easier to starttraining of the model from python scripts, because these scripts are quite similar toeach other and training process lasts for hours of computing and thus, it’s better toexecute this process in the background.

• Pytorch [11]. Since transformers belong to Deep Learning, it’s much more conve-nient to use some framework, which can handle some essential steps automatically(gradients computing, data parallelization, hidden layers executing, optimizers, lossfunctions and so on). Although, currently there are two main Deep Learning frame-works - Tensorflow (developed by Google) and PyTorch (by Facebook), it was conve-nient for me to use Pytorch, because of it’s Pythonic structure, low entry level andease of debugging and building of own network architectures.

• Huggingface’s transformers library [4]. Hugging Face is a team, which devel-oped outstanding Deep Learning frameworks (Transformers and Datasets) for conve-nient interaction with NLP transformers. They provide general-purpose architecturessuch as BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, etc. for with over 32pre-trained models in more than 100 languages. This tool, was actually the most im-portant for me, because I’ve been using its Transformers library in the research mostof the time. Their pre-trained models for chosen NLP tasks were an amazing baselinefor my transfer learning experiments. Additionally, Hugging Face’s wrapper calledTrainer freed me from the need to write a training script for each transformer indi-vidually allowing to train all models in a similar way with a convenient way to selecttraining arguments like hyperparameters, loss functions, optimizers and callbacks.

• Weights and Biases [23]. This is a tool for visualizations of machine learningexperiments. This instrument allowed me to track all my experiments, logs, metricsand artifacts. Sometimes it happen that when you have trained a model a few days oreven hours ago, it’s hard to remember what kind of arguments were selected for the

41

run, while storing them in the scripts is an overwhelming task. Fortunately, WandBsolves this issue and provides and amazing interactive tool not only to save historyabout runs, but also to compare them and even to make reports in various formats(HTML, PDF, LATEX). All the graphs, which were presented below are done byWeights and Biases.

Since transformers is a very complex Deep Learning algorithm and requires lots ofcomputational power and time for tuning. Using local machine for this task is not convenientbecause of few reasons: training process should run for long periods of time and any problemin local machine can cause an error and training can fail. Also, local machine is most ofthe time not so powerful, providing only one GPU maximum.

Therefore, all the work was done on the cluster, provisioned by university. This clusterhas 6 GPUs - GeForce RTX 2080 Ti with 11.6 gigabytes on each. The OS running on thecluster is Ubuntu, which suits for machine learning pipelines (because it’s more robust forany bugs in code in contrast to Windows).

On the cluster there is a directory called transformers, where I have stored all the scripts,used for making experiments and infering with transformers. The main script is trainer.pywhich is in charge of training and evaluating transformers on the given tasks (questionanswering, sequence classification and translation respectively). Parts of the code that donot strictly refer to training step are stored in the utils.py script. Scripts allow to provisionall necessary parameters such as pre-trained model name, batch size, number of epochs,option to freeze layers of the base model and so on. These parameters can be looked withthe command python trained.py –help

Summary of experiments:

• Total runs - 41;

• Total computing time - 8 days;

• maximum training time of a single model - 54 hours.

The whole code can be found by the following link:github.com/aielte-research/transformers

Results (graphs, logs, reports) of the held experiments can be found by the followinglink:wandb.ai/sarzhann/transformers

9.1 Sentiment Analysis (Sequence Classification)

I have made the most effort on working with a Sentiment Analysis task, because I havebeen training three different architectures (BERT, DistilBERT and XLNet) with threetraining methods (training from scratch, fine-tuning and transfer-learning). However, due toXLNet big size, complex architecture and shortage of computational resources (11 gigabyteson a single GPU) I was able only to fine-tune XLNet. Anyways, as you can see on thefigures below, it took more than 17 hours to fine-tune this transformer for 5 epochs, whileBERT and Distilbert required only one hour in average to train on the IMDB datasetfor 10 epochs. Nonetheless, this cost of time does not worth it, because XLNet achievedonly 0.891 in accuracy and 0.8907 in F1-score, while highest score was beaten by BERTtrained with transfer-learning technique (pre-trained without frozen layers). Moreover, thesecond highest score (0.9266 accuracy) was achieved by DistilBERT trained with the sametechnique.

42

https://github.com/aielte-research/transformers

https://wandb.ai/sarzhann/transformers

Figure 27: Sentiment analysis experiments results (on training set)

Thus, I have concluded that in my case, BERT is still a preferable option for tuningon sequence classification. It achieves higher results with less efforts, while XLNet requiresmore power and smarter choice of hyperparameters. Additionally, if the highest accuracyis not required for you (only 2% of accuracy), then DistilBERT is a right alternative forheavy-weight BERT.

Since, I was training models on IMDB dataset, which contains reviews on various movies,in inference mode I’ve used some simple reviews as an example to look how each of theclassifier performed:

Movie reviews:’I love this movie’’I hate this show...’

BERT classes:POSITIVE with Probability 0.9888222217559814POSITIVE with Probability 0.9887667298316956

DistilBERT classes:POSITIVE with Probability 0.9740594029426575NEGATIVE with Probability 0.8891263008117676

XLNet classes:POSITIVE with Probability 0.5262263417243958NEGATIVE with Probability 0.9174837470054626

As you can see, BERT is the most confident about each review. DistilBERT also showedquite good results, while XLNet is the worst on the positive review (only 52% of confidence).

All interactive results of sentiment analysis experiments can be viewed here:

43

Figure 28: Sentiment analysis experiments results (on validation set)

wandb.ai/sarzhann/transformers/reports/Sentiment-Analysis-Experiments–Vmlldzo3MzM2ODg

9.2 Translation

Translation task is quite unique because it requires sequence-to-sequence structure like theTransformer, therefore not every transformer could be applied for translation. Because onlyT5 has such structure among chosen algorithms, I’ve decided to train this architecture intwo versions:

• T5-base, which has 220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads and

• T5-small), which has 60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads.

Unfortunately, T5 is a quite big model, which could not fit into GPUs provided byUniversity, therefore only fine-tuning was applied for translation task. Nonetheless, toincrease the number of experiments for translation task, I’ve trained it for two languagepairs: English-German and English-Russian.

As expected, base version of T5 performed much better than small one for both lan-guages, although it required almost two times more computing time (17 hours vs 24). Fromthe other side, BLEU score on validation set is weird. In English-Russian pair BLEU scorefor both small and base version was close to zero, meaning that the model is not trained atall. The main reason is that initially T5 and it’s tokenizer were pre-trained on the Enlishcorpus of text, while Russian language was not represented in C4 there at all. English-German language pair showed better results with 13 and 14 BLEU scores for small andbase versions of T5 respectively. But still this is a quite bad translation result. The reasonof the small BLEU score is the same, small amount of German data in pre-trained datasetC4.

Because of this performance, I’ve trained additional model, based on Maria-NNT [20]specialized exactly for translation task - Helsinki-Opus-MT. I’ve been using two pre-trainedmodels - Helsinki-NLP/opus-mt-en-de and Helsinki-NLP/opus-mt-en-ru for both language

44

https://wandb.ai/sarzhann/transformers/reports/Sentiment-Analysis-Experiments--Vmlldzo3MzM2ODg

Figure 29: Neural Translation experiments results (on training set)

pairs. And, as you can see on the chart, both models achieved good results on evaluationset - 28 BLEU for Russian and 30 BLEU for German.

Figure 30: Neural Translation experiments results (on validation set)

Here you can see, how each of the model performed of translation for some examples onGerman translation:

HelsinkiSource (English): I am going to universityTarget (German): Ich gehe zur Universitat.Source (English): I love you so muchTarget (German): Ich liebe dich so sehrSource (English): Elon Musk is a bilionare

45

Target (German): Elon Musk ist eine Bilionare

T5-baseSource (English): I am going to universityTarget (German): Ich werde an die Universitat gehenSource (English): I love you so muchTarget (German): Ich liebe dich so sehrSource (English): Elon Musk is a bilionareTarget (German): Elon Musk ist ein bilionare

T5-smallSource (English): I am going to universityTarget (German): Ich werde an die Universitat gehenSource (English): I love you so muchTarget (German): Ich liebe Sie so sehrSource (English): Elon Musk is a bilionareTarget (German): Elon Musk ist ein Bilionare

All results of neural machine translation are published here:wandb.ai/sarzhann/transformers/reports/Neural-Translation-Experiments–Vmlldzo3MzM3MDY

9.3 Question Answering

As I mentioned previously, DistilBERT and BERT were used for the span-based QuestionAnswering task. Fine-tuning and transfer learning were used in the task.

The most interesting finding on question answering is that freezing layers of base modelfor both architectures resulted almost the same result - 50 F1-score and only a slightdifference in training the whole pretrained model - 70 F1 for BERT against 65 F1 forDistilBERT. Even graphs look quite similarly for these architectures.

Figure 31: Question answering experiments results (on training set)

46

https://wandb.ai/sarzhann/transformers/reports/Neural-Translation-Experiments--Vmlldzo3MzM3MDY

Figure 32: Question answering experiments results (on validation set)

The most interesting part of experiment of course is giving some data (context andquestion) and look how model answers to the question given the context. I’ve took the firstparagraphs from Wikipedia page about Vladimir Putin and bitcoin and that is how BERTand DistilBERT answered questions:

Context: Bitcoin is a decentralized digital currency, without a central bank or single ad-ministrator, that can be sent from user to user on the peer-to-peer bitcoin network withoutthe need for intermediaries. Transactions are verified by network nodes through cryptog-raphy and recorded in a public distributed ledger called a blockchain. The cryptocurrencywas invented in 2008 by an unknown person or group of people using the name SatoshiNakamoto. The currency began use in 2009 when its implementation was released as open-source software.Question: What is bitcoin?BERT response: a decentralized digital currencyDistilBERT response: a decentralized digital currency

Context: Vladimir Vladimirovich Putin (born 7 October 1952) is a Russian politician andformer intelligence officer who is serving as the current president of Russia since 2012,previously being in the office from 1999 until 2008.[7][d] He was also prime minister from1999 to 2000 and again from 2008 to 2012. As of 2021, Putin is the second-longest servingEuropean president, after Alexander Lukashenko of Belarus.Question: Who is Vladimir Putin?BERT response: a Russian politician and former intelligence officerDistilBERT response: politician and former intelligence officer

All results of question answering are published here:wandb.ai/sarzhann/transformers/reports/Question-Answering-Experiments–Vmlldzo3MzM3MjQ

9.4 Language Modelling (Text Generation)

GPT-2 and DistilGPT-2 were pre-trained for the current task and of course, GPT-2 showedbetter quality with perplexity 29 against 37 for DistilGPT-2, but the second one has con-verged to it’s minimum a bit faster. How you see on the graph, after training for 6 and 8

47

https://wandb.ai/sarzhann/transformers/reports/Question-Answering-Experiments--Vmlldzo3MzM3MjQ

epochs both models (GPT-2 and DistilGPT-2 respectively) have early stopped because ofnon-falling loss on evaluation set. Anyways, it’s enough to generate human understandabletexts in inference mode. For example, I’ve provisioned two phrases for each model: ElonMusk lives in and When I was young. Below you can see the generated text responded bymodels:

• GPT-2:Elon Musk lives in a little house in the northern part of the town . It used to be ahotbed of miners , with men of all faiths from North Queensland and all backgrounds, including Muslim. The most prominent member of the community ;

When I was young, I couldn ’t speak very well , and then we all went to Paris anddid a few short interviews because I knew I had to grow up and show the world one ofmy great talents at the age of 12. But.

• DistilGPT-2:Elon Musk lives in the community of Elgin Valley with his wife and two children ,Aline ( 21 ) and John ( 11 ) . The family moved to the U. S. state of Minnesota tomove to the country with Al ’O ;

When I was young my mum would tell her story of a night out with two men formy birthday and the next day when she and a neighbour moved to the house theycalled out about our house and asked about one of the other men.

Figure 33: Language modelling experiments results (on training set)

All results of language modelling are published here:wandb.ai/sarzhann/transformers/reports/Language-Modelling-Experiments–Vmlldzo3MzM3MTI

10 Conclusion

Overall, transformers are nowadays one of the most influential techniques practiced in NLP.They outperform all other previously known algorithms achieving state-of-the-art resultson various benchmarks like GLUE and SQuAD. Transformers are currently handled almost

48

https://wandb.ai/sarzhann/transformers/reports/Language-Modelling-Experiments--Vmlldzo3MzM3MTI

Figure 34: Language modelling experiments results (on validation set)

in all modern NLP-based systems (chatbots, artificial assistants, etc.). Transformers canbe applied for every NLP challenge (starting from simple detection of users’ sentimentand ending with dialogue maintenance with customers). Researches about transformersare currently on a high level of popularity due to it’s novelty and untapped potential forhumanity. We already know their drawbacks like colossal sizes, which make it hard to fiton weak devices like mobile phones, and requirements of enormous computational powertogether with giant datasets for training that are available only for big corporations withunlimited resources (such as Google and Facebook). Nevertheless, we are on the way tofight these shortcomings by developing size reducing methods as distillation, which help toachieve top scores with less effort. Such techniques are already applied in DistilBERT andDistilGPT-2.

In this research, I have learned about the most significant for NLP transformer archi-tectures published during the last few years (BERT, GPT, XLNet, T5 and Distilbert withDistilGPT-2). I have investigated their features and ideas behind them. After a deep studyof these models I have trained and evaluated them on some NLP tasks (sentiment analy-sis, question answering, language modelling and neural machine translation) with varioustraining techniques (training from scratch, fine-tuning and transfer learning) and with dif-ferent initial hyperparameters such as learning-rate, number of epoch, batch sizes and earlystopping callbacks.

These are the main conclusions I made during experimenting with transformers are:

• Distilled versions of models (DistilBERT and DistilGPT-2) take much less size, GPUrequirements and time for training comparing to their teacher models (BERT andGPT), although they achieve a little fall in final quality. So, if there is a requirementto run the model on a smartphone or there are some limitations in power or time,then it’s better to use the distilled version of popular models, while in other casesBERT and GPT-2 are still in use.

• Training model from scratch is the most complicated technique for fitting transformer.It requires a clever setting of hyperparameters, considerably longer time for trainingand still it is difficult to achieve state-of-the-art results on small datasets with ahigh risk to overfit. From the other side, fine-tuning the pre-trained model requiresmuch less computational resources and fits to good results dramatically faster. Butstill, high accuracy is hardly achievable in this scenario. And finally, the best of theprevious techniques - transfer-learning. With using of this method, all my models haveachieved the highest scores and even in the end they had an opportunity to increaseaccuracy (reduce loss value). Although, it’s a little bit harder to train network withsimple transfer learning rather than with fine-tuning, this method is still very efficientoption in contrast to training from scratch.

• Different NLP tasks require completely different approaches to train transformer.Not every transformer can be fitted for any NLP task. For example, to fit model for

49

machine neural translation, it’s necessary to have a full encoder-decoder structure,and thus only T5 (out of chosen ones) can be tuned for it, while others consist onlyfrom encoder or decoder blocks. From the other side, training structure for the sametask can be absolutely different for different models. For example, T5 is build withan idea that every NLP task can be converted to text-to-text problem and thus,in case of sequence classification T5 predicts the string representation of the targetnumber, while autoencoders (BERT like models) output a real vector representation(probabilities that sequence belongs to each of the classes).

• Each NLP task is evaluated with different metrics. Sentiment analysis can be eval-uated with simple accuracy and F-measure, while translation needs a complicatedBLEU metric that takes into account all the features of translation. For questionanswering complex version of F-measure is used, while for language modelling we canuse perplexity as a metric. These differences sometimes make hard to interpret howgood the model is actually trained (good or bad).

• Lastly, T5 and XLNet are really heavy-weight models and it takes more than 10 hoursto train each of them (maximum 54 hours of straight computing for fitting T5-baseon translation from English to German languages on WMT16 dataset), while BERT,DistilBERT, GPT-2 and DistilGPT-2 required me to train for 5 hours on average.Moreover, these two giants are hard to tune on small datasets and it’s very importantto have bigger GPUs and set up hyperparameters carefully.

Finally, I have developed a Jupyter notebook to interact with tuned models. You canprovide your own text examples for each of the listed NLP task a see how transformersprocess them. Most of the time models give interesting, well-structured and coherent results(you can see examples I used in Experiments page), while sometimes their output is just acollection of random tokens without any syntactical meaning. For instance, I couldn’t tuneT5 for translation from English to Russian because of the pre-trained tokenizer, and thusit all the time responded with a list of strange tokens completely not related to source text(both base and small versions of T5).

Overall, I would like to say that this research broadened my horizons for exploringamazing fields of Deep Learning like NLP and transformers. Hopefully, it’s not the endof my research in this sphere, because in future, I’m going to continue investigation oftransformers and will try to improve their performance and apply on task which are notrelated to NLP. There are already dozens of research papers investigating transformers inother Deep Learning fields. I would like to look on how transformers can be applied forcomputer vision tasks in particular object detection and tracking.

50

References

[1] Jay Alammar. The Illustrated Transformer. url: https://jalammar.github.io/illustrated-transformer/.

[2] Yoshua Bengio, Rejean Ducharme, and Pascal Vincent. “A Neural Probabilistic Lan-guage Model”. In: Proceedings of the 13th International Conference on Neural Infor-mation Processing Systems. NIPS’00. Denver, CO: MIT Press, 2000, pp. 893–899.

[3] Ond rej Bojar et al. “Findings of the 2016 Conference on Machine Translation”.In: Proceedings of the First Conference on Machine Translation. Berlin, Germany:Association for Computational Linguistics, 2016, pp. 131–198. url: http://www.aclweb.org/anthology/W/W16/W16-2301.

[4] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation”. In: CoRR abs/1406.1078 (2014). arXiv: 1406.1078. url: http://arxiv.org/abs/1406.1078.

[5] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. 2019. arXiv: 1810.04805 [cs.CL].

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[7] Rani Horev. BERT Explained: State of the art language model for NLP. 2018. url:https : / / towardsdatascience . com / bert - explained - state - of - the - art -

language-model-for-nlp-f8b21a9b6270.

[8] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approachesto Attention-based Neural Machine Translation”. In: arXiv e-prints, arXiv:1508.04025(Aug. 2015), arXiv:1508.04025. arXiv: 1508.04025 [cs.CL].

[9] Andrew L. Maas et al. “Learning Word Vectors for Sentiment Analysis”. In: Proceed-ings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu-man Language Technologies. Portland, Oregon, USA: Association for ComputationalLinguistics, 2011, pp. 142–150. url: http://www.aclweb.org/anthology/P11-1015.

[10] Tomas Mikolov et al. “Distributed Representations of Words and Phrases and theirCompositionality”. In: Advances in Neural Information Processing Systems. Ed. byC. J. C. Burges et al. Vol. 26. Curran Associates, Inc., 2013. url: https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.

[11] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learn-ing Library”. In: Advances in Neural Information Processing Systems 32. Ed. by H.Wallach et al. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-

deep-learning-library.pdf.

[12] Alec Radford and Karthik Narasimhan. “Improving Language Understanding by Gen-erative Pre-Training”. In: 2018.

[13] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”. In: ().

[14] Colin Raffel et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2020. arXiv: 1910.10683 [cs.LG].

[15] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know What You Don’t Know: Unan-swerable Questions for SQuAD. 2018. arXiv: 1806.03822 [cs.CL].

[16] Victor Sanh et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaperand lighter. 2020. arXiv: 1910.01108 [cs.CL].

[17] Mike Schuster and Kaisuke Nakajima. “Japanese and Korean Voice Search”. In: In-ternational Conference on Acoustics, Speech and Signal Processing. 2012, pp. 5149–5152.

51

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/illustrated-transformer/

http://www.aclweb.org/anthology/W/W16/W16-2301

http://www.aclweb.org/anthology/W/W16/W16-2301

https://arxiv.org/abs/1406.1078


http://arxiv.org/abs/1406.1078


http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270


http://www.aclweb.org/anthology/P11-1015

https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf






[18] Merity Stephen et al. In: 2016.

[19] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to Sequence Learningwith Neural Networks”. In: Advances in Neural Information Processing Systems. Ed.by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 2014. url: https://

proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-

Paper.pdf.

[20] Jorg Tiedemann and Santhosh Thottingal. “OPUS-MT — Building open transla-tion services for the World”. In: Proceedings of the 22nd Annual Conferenec of theEuropean Association for Machine Translation (EAMT). Lisbon, Portugal, 2020.

[21] Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Scotts Valley,CA: CreateSpace, 2009. isbn: 1441412697.

[22] Ashish Vaswani et al. “Attention Is All You Need”. In: arXiv e-prints, arXiv:1706.03762(June 2017), arXiv:1706.03762. arXiv: 1706.03762 [cs.CL].

[23] Weights and Biases. url: https://wandb.ai/site.

[24] Zhilin Yang et al. XLNet: Generalized Autoregressive Pretraining for Language Un-derstanding. 2020. arXiv: 1906.08237 [cs.CL].

52

https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf




https://wandb.ai/site


EOTV OS LOR AND UNIVERSITY FACULTY OF SCIENCE

Documents

Transcript of EOTV OS LOR AND UNIVERSITY FACULTY OF SCIENCE