Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP...

Adaptation of Deep Bidirectional Multilingual

Transformers for Russian Language

Kuratov Yuri, Arkhipov MikhailNeural Networks and Deep Learning Lab,

Moscow Institute of Physics and Technology

[email protected]

Plan

● Transfer learning and pretraining in NLP○ BERT, ELMo, GPT

● RuBERT - transfer from Multilingual BERT model● Evaluation of RuBERT on:

○ Classification (Paraphrase Identification and Sentiment Analysis)○ Question Answering on SDSJ Task B (SQuAD)

● Results and conclusions

Transfer learning and pretraining in NLP

● Word Embeddings (w2v, GloVe)○ word vectors are independent from context

● Language Model Pretraining○ task is to predict next word○ P(Wi | W1,…, Wi-1)○ ELMo, OpenAI GPT

● Masked Language Model Pretraining○ P(Wi | W1,…, Wi-1, Wi+1,…, WN)

● Masked Language Model Pretraining and auxiliary tasks○ Next sentence prediction (BERT)

● Combining Language Model, Masked Language Model and Seq2Seq○ Unified Language Model Pre-training for Natural Language Understanding and Generation https://arxiv.org/abs/1905.03197

BERT: https://arxiv.org/abs/1810.04805ELMo: https://arxiv.org/abs/1802.05365GPT: https://github.com/openai/gpt-2

Language Modeling

https://arxiv.org/abs/1905.03197



https://github.com/openai/gpt-2

Transfer learning and pretraining in NLP

BERT paper: https://arxiv.org/abs/1810.04805Illustrated BERT, ELMo, GPT: http://jalammar.github.io/illustrated-bert/

Language Model Pretraining (Unidirectional)Masked Language Model


http://jalammar.github.io/illustrated-bert/

BERT Bidirectional Encoder Representations from Transformers

BERT paper: https://arxiv.org/abs/1810.04805


BERT Bidirectional Encoder Representations from Transformers

BERT paper: https://arxiv.org/abs/1810.04805

[mask]

[mask] [mask]

[mask]

P(is_next_sentence) P(dog | W0, …, W10) P(he | W0, …, W10)


Multilingual BERT Why do we need BERT for Russian?

● Three key motivations:

● BERT based models show state-of-the-art performance on wide range of NLP tasks● Multilingual BERT model was trained for 104 languages on Wikipedia

○ vocabulary size: ~120k subtokens○ only ~25k subtokens (~20%) in vocabulary are related to Russian language○ model has 180M parameters and half of them are used by subtoken embeddings○ 50% + 50% * 20% = 60% of total model parameters could be used for Russian texts

● Single-language BERT models outperform Multilingual BERT model:○ It was explored for English and Chinese BERT models by Google Research

Multilingual BERT: https://github.com/google-research/bert/blob/master/multilingual.md

https://github.com/google-research/bert/blob/master/multilingual.md

RuBERT Building Russian vocabulary

● We applied Subword NMT to Russian Wiki (80%) and news data (20%)● Effect of new Russian vocabulary:

○ 120k subtokens for Russian language○ ~1.5 times longer sequences could be fit to model

RuBERT Transfer from Multilingual BERT model

http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

How to initialize RuBERT model?

● Random● Initialize with multilingual model,

random init for new subtokens

● Can we do better?

http://docs.deeppavlov.ai/en/master/components/bert.html

http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

RuBERT Transfer from Multilingual BERT model


Can we do better?

● Initialize with multilingual model and assemble embeddings for new subtokens:

bird = bi ##rdEmb(bird) := Emb(bi) + Emb(##rd)

250k steps ~ 2 days of computations on Tesla P-100 16Gb x 8



RuBERT Training details


Model was trained in two stages: ● train full BERT on sequences with 128 subtokens length● train only positional embeddings on 512 length sequences

We used following hyperparameters:● batch size: 256● learning rate: 2 · 10−5

● optimizer: Adam● L2 regularization: 10−2

To support multi-gpu training we made fork of original Tensorflow BERT repo:https://github.com/deepmipt/bert/tree/feat/multi_gpu



https://github.com/deepmipt/bert/tree/feat/multi_gpu

RuBERT for Classification

RuBERT for Classification Paraphrase Identification

● ParaPhraser - dataset for Russian paraphrase detection (~7k training pairs)○ “9 мая метрополитен Петербурга будет работать круглосуточно”○ “Петербургское метро в ночь на 10 мая будет работать круглосуточно”○ domain: news

We compare BERT based models with other models in nonstandard run setting, when all resources were allowed.

http://paraphraser.ru/download/[1] Pivovarova, L., et al. (2017). Paraphraser: Russian paraphrase corpus and shared task. [2] Kravchenko, D. (2017). Paraphrase detection using machine translation and textual similarity algorithms.

Model F-1 Accuracy

Classifier + linguistic features [1] 81.10 77.39

Machine Translation + Semantic similarity [2] 78.51 81.41

BERT multilingual 85.48 ± 0.19 81.66 ± 0.38

RuBERT 87.73 ± 0.26 84.99 ± 0.35

http://paraphraser.ru/download/

http://paraphraser.ru/download/

RuBERT for Classification Sentiment Analysis

● RuSentiment - dataset for sentiment analysis of posts from VKontakte● domain: social networks

http://text-machine.cs.uml.edu/projects/rusentiment/http://docs.deeppavlov.ai/en/master/intro/features.html#classification-component

http://text-machine.cs.uml.edu/projects/rusentiment/

http://text-machine.cs.uml.edu/projects/rusentiment/

http://docs.deeppavlov.ai/en/master/intro/features.html#classification-component

Question Answering on SDSJ Task B (SQuAD)

● Context:

In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.

● Question:

Where do water droplets collide with ice crystals to form precipitation?

● datasets: Stanford Question Answering Dataset (SQuAD), Natural Questions, SDSJ 2017 Task B● SDSJ Task B: ~50k context-question-answer triplets

SQuAD: https://rajpurkar.github.io/SQuAD-explorer/Natural Questions: https://ai.google.com/research/NaturalQuestionsSDSJ 2017: https://sdsj.sberbank.ai/2017/ru/contest.html, http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b

https://rajpurkar.github.io/SQuAD-explorer/

https://ai.google.com/research/NaturalQuestions

https://sdsj.sberbank.ai/2017/ru/contest.html

http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b

BERT for Question Answering

RuBERT for Question Answering on SDSJ Task B



Results

● ParaPhraser and SDSJ Task B + 4-6 F-1● RuSentiment +1 F-1 improvement from previous state-of-the-art

● ParaPhraser and SDSDJ Task B share the same domain with RuBERT (wiki, news)

● RuSentiment is more challenging due to domain shift, but still we could show good results

Beyond this work

We trained single BERT model for Slavic languages (ru, bg, cs, pl) in the same manner as for Russian Language and evaluated it on BSNLP 2019 Shared Task on Multilingual Named Entity Recognition:

These results were obtained on validation set.

Results from Arkhipov M., Trofimova M., Kuratov Y., Sorokin A., Tuning Multilingual Transformers for Named Entity Recognition on Slavic Languages

Results and conclusions

● We trained RuBERT model for Russian language● We achieved significant improvements on several Russian datasets with

RuBERT model● RuBERT, SlavicBERT and all pre-trained models are open-sourced, e.g.:


python -m deeppavlov install squad_ru_rubert

python -m deeppavlov download squad_ru_rubert

python -m deeppavlov interact/riseapi squad_ru_rubert

from deeppavlov import build_model, configsmodel = build_model(configs.squad.squad_ru_rubert, download=True)model(['DeepPavlov это библиотека для NLP и диалоговых систем.'], ['Что такое DeepPavlov?'])>> [['библиотека для NLP и диалоговых систем'], [15], [2758812.25]]



github.com/deepmipt/DeepPavlovdocs.deeppavlov.ai

Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP...

Documents

Transcript of Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP...