BERT: Pre-training of Deep Bidirectional Transformers for...

BERT:

Pre-training of

Deep Bidirectional Transformers for

Language Understanding

Source : NAACL-HLT 2019

Speaker : Ya-Fang, Hsiao

Advisor : Jia-Ling, Koh

Date : 2019/09/02

CONTENTS

1

Introduction

Related

Work

Method

Experiment

Conclusion

2

3

4

5

1Introduction

Introduction

Bidirectional Encoder Representations from Transformers

Language Model

𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ

𝑡=1

𝑇

ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1

Pre-trained Language Model

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

2 Related

Work

Related Work


Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

Related Work


Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

1. Unidirectional language model

2. Same objective function

Bidirectional Encoder Representations

from Transformers

Masked Language Models (MLM)

Next Sentence Prediction (NSP)

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder Decoder

Sequence2sequence

RNN : hard to parallel

Encoder-Decoder


Transformers


Transformers

Encoder-Decoder *6

Self-attention layer

can be parallelly computed


Transformers

Self-Attentionquery (to match others)

key (to be matched)

information to be extracted


Transformers

Multi-Head Attention

Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)

BERTBASE

(L=12, H=768, A=12, Parameters=110M)

BERTLARGE

(L=24, H=1024, A=16, Parameters=340M)

L

A

4H

BERTBidirectional Encoder Representations

from Transformers

3Method

Framework

Pre-training : trained on unlabeled data over different

pre-training tasks.

Fine-Tuning : fine-tuned parameters using labeled data

from the downstream tasks.

Input

Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.

[CLS] : classification token

[SEP] : separate token

Segment Embedding : Learned embeddings belong to sentence A or sentence B.

Position Embedding : Learned positional embeddings.

Pre-training corpus : BooksCorpus、English Wikipedia

Pre-training

Two unsupervised tasks:

1. Masked Language Models (MLM)

2. Next Sentence Prediction (NSP)

Task1. MLM Masked Language Models

Hung-Yi Lee - BERT ppt

Mask 15% of all WordPiece tokens

in each sequence at random for prediction.

Replace the token with

(1) the [MASK] token 80% of the time.

(2) a random token 10% of the time.

(3) the unchanged i-th token 10% of the time.

Task2. NSP Next Sentence Prediction


Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

Fine-Tuning

Fine-Tuning : fine-tuned parameters using labeled

data from the downstream tasks.

Task 1 (b)

BERT

[CLS] w1 w2 w3

Linear Classifier

class

Input: single sentence, output: class

sentence

Example:Sentiment analysis Document Classification

Trained from Scratch

Fine-tune


Single Sentence Classification Tasks

BERT

[CLS] w1 w2 w3

Linear Cls

class

Input: single sentence, output: class of each word

sentence

Example: Slot filling

Linear Cls

class

Linear Cls

class


Task 2 (d) Single Sentence Tagging Tasks

Linear Classifier

w1 w2

BERT

[CLS] [SEP]

Class

Sentence 1 Sentence 2

w3 w4 w5

Input: two sentences,output: class

Example: Natural Language Inference


Task 3 (a) Sentence Pair Classification Tasks

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒

Document:

Query:

Answer:

𝐷

𝑄

𝑠

𝑒

17

77 79

𝑠 = 17, 𝑒 = 17

𝑠 = 77, 𝑒 = 79


Task 4 (c) Question Answering Tasks

q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch



q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch


dot product

Softmax

0.20.1 0.7


Experiment

4

Experiments Fine-tuning results on 11 NLP tasks

Implements LeeMeng-進擊的BERT (Pytorch)

5 Conclusion

References

語言模型發展 http://bit.ly/nGram2NNLM

語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT

Attention Is All You Need http://bit.ly/AttIsAllUNeed

BERT http://bit.ly/BERTpaper

李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer

Illustrated Transformer http://bit.ly/illustratedTransformer

詳解Transformer http://bit.ly/explainTransformer

github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch

實作假新聞分類 http://bit.ly/implementpaircls

Pytorch.org_BERT http://bit.ly/pytorchorgBERT

http://bit.ly/nGram2NNLM

http://bit.ly/ELMo_OpenAIGPT_BERT

http://bit.ly/AttIsAllUNeed

http://bit.ly/BERTpaper

http://bit.ly/HungYiLee_Transformer

http://bit.ly/illustratedTransformer

http://bit.ly/explainTransformer

http://bit.ly/BERT_pytorch

http://bit.ly/implementpaircls

http://bit.ly/pytorchorgBERT

BERT: Pre-training of Deep Bidirectional Transformers for...

Documents

Transcript of BERT: Pre-training of Deep Bidirectional Transformers for...