BERT: Pre-training of Deep Bidirectional Transformers for...

36
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Source : NAACL-HLT 2019 Speaker : Ya-Fang, Hsiao Advisor : Jia-Ling, Koh Date : 2019/09/02

Transcript of BERT: Pre-training of Deep Bidirectional Transformers for...

Page 1: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

BERT:

Pre-training of

Deep Bidirectional Transformers for

Language Understanding

Source : NAACL-HLT 2019

Speaker : Ya-Fang, Hsiao

Advisor : Jia-Ling, Koh

Date : 2019/09/02

Page 2: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

CONTENTS

1

Introduction

Related

Work

Method

Experiment

Conclusion

2

3

4

5

Page 3: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

1Introduction

Page 4: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Introduction

Bidirectional Encoder Representations from Transformers

Language Model

𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ

𝑡=1

𝑇

ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1

Pre-trained Language Model

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

Page 5: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

2 Related

Work

Page 6: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Related Work

Pre-trained Language Model

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

Page 7: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Related Work

Pre-trained Language Model

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

1. Unidirectional language model

2. Same objective function

Bidirectional Encoder Representations

from Transformers

Masked Language Models (MLM)

Next Sentence Prediction (NSP)

Page 8: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder Decoder

Sequence2sequence

RNN : hard to parallel

Page 9: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Encoder-Decoder

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Page 10: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder-Decoder *6

Self-attention layer

can be parallelly computed

Page 11: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Self-Attentionquery (to match others)

key (to be matched)

information to be extracted

Page 12: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Multi-Head Attention

Page 13: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)

Page 14: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

BERTBASE

(L=12, H=768, A=12, Parameters=110M)

BERTLARGE

(L=24, H=1024, A=16, Parameters=340M)

L

A

4H

BERTBidirectional Encoder Representations

from Transformers

Page 15: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

3Method

Page 16: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Framework

Pre-training : trained on unlabeled data over different

pre-training tasks.

Fine-Tuning : fine-tuned parameters using labeled data

from the downstream tasks.

Page 17: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Input

Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.

[CLS] : classification token

[SEP] : separate token

Segment Embedding : Learned embeddings belong to sentence A or sentence B.

Position Embedding : Learned positional embeddings.

Pre-training corpus : BooksCorpus、English Wikipedia

Page 18: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Pre-training

Two unsupervised tasks:

1. Masked Language Models (MLM)

2. Next Sentence Prediction (NSP)

Page 19: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Task1. MLM Masked Language Models

Hung-Yi Lee - BERT ppt

Mask 15% of all WordPiece tokens

in each sequence at random for prediction.

Replace the token with

(1) the [MASK] token 80% of the time.

(2) a random token 10% of the time.

(3) the unchanged i-th token 10% of the time.

Page 20: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Task2. NSP Next Sentence Prediction

Hung-Yi Lee - BERT ppt

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

Page 21: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Fine-Tuning

Fine-Tuning : fine-tuned parameters using labeled

data from the downstream tasks.

Page 22: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Task 1 (b)

BERT

[CLS] w1 w2 w3

Linear Classifier

class

Input: single sentence, output: class

sentence

Example:Sentiment analysis Document Classification

Trained from Scratch

Fine-tune

Hung-Yi Lee - BERT ppt

Single Sentence Classification Tasks

Page 23: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

BERT

[CLS] w1 w2 w3

Linear Cls

class

Input: single sentence, output: class of each word

sentence

Example: Slot filling

Linear Cls

class

Linear Cls

class

Hung-Yi Lee - BERT ppt

Task 2 (d) Single Sentence Tagging Tasks

Page 24: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Linear Classifier

w1 w2

BERT

[CLS] [SEP]

Class

Sentence 1 Sentence 2

w3 w4 w5

Input: two sentences,output: class

Example: Natural Language Inference

Hung-Yi Lee - BERT ppt

Task 3 (a) Sentence Pair Classification Tasks

Page 25: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒

Document:

Query:

Answer:

𝐷

𝑄

𝑠

𝑒

17

77 79

𝑠 = 17, 𝑒 = 17

𝑠 = 77, 𝑒 = 79

Hung-Yi Lee - BERT ppt

Task 4 (c) Question Answering Tasks

Page 26: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

Hung-Yi Lee - BERT ppt

Task 4 (c) Question Answering Tasks

Page 27: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

Hung-Yi Lee - BERT ppt

dot product

Softmax

0.20.1 0.7

Task 4 (c) Question Answering Tasks

Page 28: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Experiment

4

Page 29: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Experiments Fine-tuning results on 11 NLP tasks

Page 30: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Implements LeeMeng-進擊的BERT (Pytorch)

Page 31: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Implements LeeMeng-進擊的BERT (Pytorch)

Page 32: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Implements LeeMeng-進擊的BERT (Pytorch)

Page 33: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Implements LeeMeng-進擊的BERT (Pytorch)

Page 34: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

Implements LeeMeng-進擊的BERT (Pytorch)

Page 35: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

5 Conclusion

Page 36: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English

References

語言模型發展 http://bit.ly/nGram2NNLM

語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT

Attention Is All You Need http://bit.ly/AttIsAllUNeed

BERT http://bit.ly/BERTpaper

李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer

Illustrated Transformer http://bit.ly/illustratedTransformer

詳解Transformer http://bit.ly/explainTransformer

github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch

實作假新聞分類 http://bit.ly/implementpaircls

Pytorch.org_BERT http://bit.ly/pytorchorgBERT