Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP...

21
Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language Kuratov Yuri, Arkhipov Mikhail Neural Networks and Deep Learning Lab, Moscow Institute of Physics and Technology [email protected]

Transcript of Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP...

Page 1: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Adaptation of Deep Bidirectional Multilingual

Transformers for Russian Language

Kuratov Yuri, Arkhipov MikhailNeural Networks and Deep Learning Lab,

Moscow Institute of Physics and Technology

[email protected]

Page 2: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Plan

● Transfer learning and pretraining in NLP○ BERT, ELMo, GPT

● RuBERT - transfer from Multilingual BERT model● Evaluation of RuBERT on:

○ Classification (Paraphrase Identification and Sentiment Analysis)○ Question Answering on SDSJ Task B (SQuAD)

● Results and conclusions

Page 3: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Transfer learning and pretraining in NLP

● Word Embeddings (w2v, GloVe)○ word vectors are independent from context

● Language Model Pretraining○ task is to predict next word○ P(Wi | W1,…, Wi-1)○ ELMo, OpenAI GPT

● Masked Language Model Pretraining○ P(Wi | W1,…, Wi-1, Wi+1,…, WN)

● Masked Language Model Pretraining and auxiliary tasks○ Next sentence prediction (BERT)

● Combining Language Model, Masked Language Model and Seq2Seq○ Unified Language Model Pre-training for Natural Language Understanding and Generation https://arxiv.org/abs/1905.03197

BERT: https://arxiv.org/abs/1810.04805ELMo: https://arxiv.org/abs/1802.05365GPT: https://github.com/openai/gpt-2

Language Modeling

Page 4: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Transfer learning and pretraining in NLP

BERT paper: https://arxiv.org/abs/1810.04805Illustrated BERT, ELMo, GPT: http://jalammar.github.io/illustrated-bert/

Language Model Pretraining (Unidirectional)Masked Language Model

Page 5: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

BERT Bidirectional Encoder Representations from Transformers

BERT paper: https://arxiv.org/abs/1810.04805

Page 6: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

BERT Bidirectional Encoder Representations from Transformers

BERT paper: https://arxiv.org/abs/1810.04805

[mask]

[mask] [mask]

[mask]

P(is_next_sentence) P(dog | W0, …, W10) P(he | W0, …, W10)

Page 7: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Multilingual BERT Why do we need BERT for Russian?

● Three key motivations:

● BERT based models show state-of-the-art performance on wide range of NLP tasks● Multilingual BERT model was trained for 104 languages on Wikipedia

○ vocabulary size: ~120k subtokens○ only ~25k subtokens (~20%) in vocabulary are related to Russian language○ model has 180M parameters and half of them are used by subtoken embeddings○ 50% + 50% * 20% = 60% of total model parameters could be used for Russian texts

● Single-language BERT models outperform Multilingual BERT model:○ It was explored for English and Chinese BERT models by Google Research

Multilingual BERT: https://github.com/google-research/bert/blob/master/multilingual.md

Page 8: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT Building Russian vocabulary

● We applied Subword NMT to Russian Wiki (80%) and news data (20%)● Effect of new Russian vocabulary:

○ 120k subtokens for Russian language○ ~1.5 times longer sequences could be fit to model

Page 9: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT Transfer from Multilingual BERT model

http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

How to initialize RuBERT model?

● Random● Initialize with multilingual model,

random init for new subtokens

● Can we do better?

Page 10: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT Transfer from Multilingual BERT model

http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

Can we do better?

● Initialize with multilingual model and assemble embeddings for new subtokens:

bird = bi ##rdEmb(bird) := Emb(bi) + Emb(##rd)

250k steps ~ 2 days of computations on Tesla P-100 16Gb x 8

Page 11: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT Training details

http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

Model was trained in two stages: ● train full BERT on sequences with 128 subtokens length● train only positional embeddings on 512 length sequences

We used following hyperparameters:● batch size: 256● learning rate: 2 · 10−5

● optimizer: Adam● L2 regularization: 10−2

To support multi-gpu training we made fork of original Tensorflow BERT repo:https://github.com/deepmipt/bert/tree/feat/multi_gpu

Page 12: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT for Classification

Page 13: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT for Classification Paraphrase Identification

● ParaPhraser - dataset for Russian paraphrase detection (~7k training pairs)○ “9 мая метрополитен Петербурга будет работать круглосуточно”○ “Петербургское метро в ночь на 10 мая будет работать круглосуточно”○ domain: news

We compare BERT based models with other models in nonstandard run setting, when all resources were allowed.

http://paraphraser.ru/download/[1] Pivovarova, L., et al. (2017). Paraphraser: Russian paraphrase corpus and shared task. [2] Kravchenko, D. (2017). Paraphrase detection using machine translation and textual similarity algorithms.

Model F-1 Accuracy

Classifier + linguistic features [1] 81.10 77.39

Machine Translation + Semantic similarity [2] 78.51 81.41

BERT multilingual 85.48 ± 0.19 81.66 ± 0.38

RuBERT 87.73 ± 0.26 84.99 ± 0.35

Page 14: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT for Classification Sentiment Analysis

● RuSentiment - dataset for sentiment analysis of posts from VKontakte● domain: social networks

http://text-machine.cs.uml.edu/projects/rusentiment/http://docs.deeppavlov.ai/en/master/intro/features.html#classification-component

Page 15: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Question Answering on SDSJ Task B (SQuAD)

● Context:

In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.

● Question:

Where do water droplets collide with ice crystals to form precipitation?

● datasets: Stanford Question Answering Dataset (SQuAD), Natural Questions, SDSJ 2017 Task B● SDSJ Task B: ~50k context-question-answer triplets

SQuAD: https://rajpurkar.github.io/SQuAD-explorer/Natural Questions: https://ai.google.com/research/NaturalQuestionsSDSJ 2017: https://sdsj.sberbank.ai/2017/ru/contest.html, http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b

Page 16: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

BERT for Question Answering

Page 17: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

RuBERT for Question Answering on SDSJ Task B

http://docs.deeppavlov.ai/en/master/components/squad.html#sdsj-task-b

Page 18: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Results

● ParaPhraser and SDSJ Task B + 4-6 F-1● RuSentiment +1 F-1 improvement from previous state-of-the-art

● ParaPhraser and SDSDJ Task B share the same domain with RuBERT (wiki, news)

● RuSentiment is more challenging due to domain shift, but still we could show good results

Page 19: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Beyond this work

We trained single BERT model for Slavic languages (ru, bg, cs, pl) in the same manner as for Russian Language and evaluated it on BSNLP 2019 Shared Task on Multilingual Named Entity Recognition:

These results were obtained on validation set.

Results from Arkhipov M., Trofimova M., Kuratov Y., Sorokin A., Tuning Multilingual Transformers for Named Entity Recognition on Slavic Languages

Page 20: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

Results and conclusions

● We trained RuBERT model for Russian language● We achieved significant improvements on several Russian datasets with

RuBERT model● RuBERT, SlavicBERT and all pre-trained models are open-sourced, e.g.:

http://docs.deeppavlov.ai/en/master/components/bert.htmlhttp://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz

python -m deeppavlov install squad_ru_rubert

python -m deeppavlov download squad_ru_rubert

python -m deeppavlov interact/riseapi squad_ru_rubert

from deeppavlov import build_model, configsmodel = build_model(configs.squad.squad_ru_rubert, download=True)model(['DeepPavlov это библиотека для NLP и диалоговых систем.'], ['Что такое DeepPavlov?'])>> [['библиотека для NLP и диалоговых систем'], [15], [2758812.25]]

Page 21: Adaptation of Deep Bidirectional ... - dialog-21.ru · Transfer learning and pretraining in NLP BERT, ELMo, GPT RuBERT - transfer from Multilingual BERT model Evaluation of RuBERT

github.com/deepmipt/DeepPavlovdocs.deeppavlov.ai