BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...
Transcript of BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART...
![Page 1: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/1.jpg)
BERT and its familyHung-yi Lee 李宏毅
![Page 2: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/2.jpg)
Text without annotation
Pre-train
A model that can read text
Model
Model
Task Specific
Model
Task Specific
Model
Task Specific
Fine-tune
Task-specific data with annotation
![Page 3: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/3.jpg)
![Page 4: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/4.jpg)
ELMo(Embeddings from Language Models)
BERT (Bidirectional Encoder Representations from Transformers)
ERNIE (Enhanced Representation through Knowledge Integration)
Big BirdBig Binary Recursive Decoder?
Grover (Generating aRticles by Only Viewing mEtadata Records)
BERT & PALs
(Projected Attention Layers)
![Page 5: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/5.jpg)
Outline
What is pre-train model
How to fine-tune
How to pre-train
![Page 6: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/6.jpg)
Pre-train Model
![Page 7: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/7.jpg)
Pre-train Model
養 隻 狗
Represent each token by a embedding vector
Glove
The token with the same type has the same embedding.
Model Model Model
Simply a table look-up
……
狗貓 雞羊 豬
[Pennington, et al., EMNLP’14]
[Mikolov, et al., NIPS’13]Word2vec
![Page 8: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/8.jpg)
Pre-train Model
養 隻 狗
Represent each token by a embedding vector
The token with the same type has the same embedding.
Model Model Model
English word as token …
Model
r e u s e
FastText[Bojanowski, et al., TACL’17]
![Page 9: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/9.jpg)
Pre-train Model
養 隻 狗
Represent each token by a embedding vector
The token with the same type has the same embedding.
Model Model Model
Chinese character as token …
[Su, et al., EMNLP’17]
Image?
Model(CNN)
![Page 10: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/10.jpg)
Pre-train Model
養 隻 狗
Represent each token by a embedding vector
Model Model Model
單 身 狗
Model Model Model
same
![Page 11: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/11.jpg)
Pre-train Model
養 隻 狗
Represent each token by a embedding vector
Model Model Model
單 身 狗
Model Model Model
different
![Page 12: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/12.jpg)
Pre-train ModelContextualized Word Embedding
養 隻 狗
Model Model Model
單 身 狗
Model Model ModelModel Model
![Page 13: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/13.jpg)
Pre-train Model
養 隻 狗
Model
• LSTM
• Self-attention layers
• Tree-based model (?)• Ref: https://youtu.be/z0uOq2wEGcc
Contextualized Word Embedding
Many Layers
![Page 14: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/14.jpg)
![Page 15: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/15.jpg)
Bigger Model Source of image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
[Shoeybi, et al., arXiv’19]
Megatron
TuringNLG
![Page 16: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/16.jpg)
Smaller Model
[Sanh, et al., NeurIPS workshop’19]
Distill BERT
Tiny BERT [Jian, et al., arXiv’19]
Mobile BERT [Sun, et al., ACL’20]
[Zafrir, et al., NeurIPS workshop 2019]
[Lan, et al., ICLR’20]ALBERT
Q8BERT
![Page 17: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/17.jpg)
Smaller Model
• Network Compression
• Network Pruning
• Knowledge Distillation
• Parameter Quantization
• Architecture Design
Ref: https://youtu.be/dPp8rCAnU_A
Excellent reference: http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html
All of them have been tried.
![Page 18: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/18.jpg)
Network Architecture
• Transformer-XL: Segment-Level Recurrence with State Reuse
• Reformer
• Longformer
[Dai, et al., ACL’19]
[Kitaev, et al., ICLR’20]
[Beltagy, et al., arXiv’20]
Reduce the complexity of self-attention
![Page 19: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/19.jpg)
How to fine-tune
Pre-trained Model
w1 w2 w3 w4
Task-specific Layer
For a specific NLP task
![Page 20: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/20.jpg)
NLP tasks
Input
Output
one sentence
multiple sentences
one class
class for each token
copy from input
general sequence
![Page 21: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/21.jpg)
Input one sentence
multiple sentences
w1 w2
Model
[SEP]
Sentence 1 Sentence 2
w3 w4 w5
Query Document
Premise Hypothesis
![Page 22: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/22.jpg)
Output
Model
[CLS] w1 w2 w3
Task Specific
class
input
one class
class for each token
copy from input
general sequence Task Specific
class
![Page 23: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/23.jpg)
Output
Model
[CLS] w1 w2 w3
input
one class
class for each token
copy from input
general sequence
class class class
Task Specific
![Page 24: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/24.jpg)
one class
class for each token
copy from input
general sequence
Output
• Extraction-based QA
𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁
𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑀
QAModel
output: two integers (𝑠, 𝑒)
𝐴 = 𝑑𝑠, ⋯ , 𝑑𝑒
Document:
Query:
Answer:
𝐷
𝑄
𝑠
𝑒
77 79
𝑠 = 77, 𝑒 = 79
![Page 25: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/25.jpg)
q1 q2
Copy from Input (BERT)
Model
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.50.3 0.2
s = 2
TaskSpecific
![Page 26: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/26.jpg)
q1 q2
Copy from Input (BERT)
Model
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.20.1 0.7
The answer is “d2 d3”.
s = 2 e = 3
TaskSpecific
![Page 27: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/27.jpg)
Output – General Sequence (v1)
• Seq2seq model
Model
[CLS] w1 w2 w3
input sequence
Task Specific
w4 w5 <EOS>Attention
output sequence
Encoder Decoder
![Page 28: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/28.jpg)
Output – General Sequence (v2)
w1 w2
Model
[SEP] w3 w4 w5
Task Specific
w3
Task Specific
w4
Task Specific
w5
Task Specific
input sequence
<EOS>output sequence
![Page 29: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/29.jpg)
w1 w2 w3
Task-specific
Pre-trained
Fine-tune
A gigantic model for down-stream tasks
w1 w2 w3
Task-specific
Pre-trainedModel
How to fine-tune
Fine-tune
Feature Extractor (Fix)
![Page 30: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/30.jpg)
[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]
Model
Task Specific
Model
Task Specific
Model
Task Specific
Adaptor
Model
Task Specific
Model
Task Specific
Model
Task Specific
Fine-tune
Large Memory is needed QQ
![Page 31: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/31.jpg)
[Houlsby, et al., ICML’19][Stickland, et al., ICML’19]
Model
Task Specific
Model
Task Specific
Model
Task Specific
Adaptor
Task Specific
Task Specific
Task Specific
Fine-tune
Apt Apt Apt
Model Model ModelApt Apt Apt
![Page 32: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/32.jpg)
[Houlsby, et al., ICML’19]Source of image: https://arxiv.org/abs/1902.00751
![Page 33: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/33.jpg)
[Houlsby, et al., ICML’19]Source of image: https://arxiv.org/abs/1902.00751
![Page 34: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/34.jpg)
Layer 1
w1 w2 w3
Layer 2
WholeModel
+
𝑥1
𝑥2
𝑤1𝑥1 +𝑤2𝑥
2
𝑤2
𝑤1
𝑤1 and 𝑤2 are learned in down-stream tasks
Weighted Features Task Specific
![Page 35: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/35.jpg)
Why Pre-train Models?
• GLUE scores
Source of image: https://arxiv.org/abs/1905.00537
![Page 36: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/36.jpg)
Why Fine-tune?
Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]
![Page 37: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/37.jpg)
Why Fine-tune?How to generate the figures below?https://youtu.be/XysGHdNOTbg
Source of image: https://arxiv.org/abs/1908.05620[Hao, et al., EMNLP’19]
![Page 38: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/38.jpg)
How to Pre-train
Text without annotation
Pre-train
A model that can read text
Model
![Page 39: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/39.jpg)
Pre-training by Translation
• Context Vector (CoVe)
Input: A language
output: B language
w1 w2 w3
Decoder
w5 w6 w7
w4
ModelEncoder
Need sentences pairs for languages A and B
![Page 40: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/40.jpg)
Self-supervised Learning
Supervised Self-
supervised 𝑥
𝑦
𝑥
𝑥′
𝑥′′label
Model Model
Text without annotation
![Page 41: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/41.jpg)
Predict Next Token
w1 w2 w3
w2 w3 w4
ℎ1 ℎ2 ℎ3 ℎ4
Model
? ? ? ?
w4
w5
LinearTransform
softmax
Cross entropy
wt+1
from wt ℎ𝑡
![Page 42: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/42.jpg)
Predict Next Token
w1 w2 w3
w2 w3 w4
ℎ1 ℎ2 ℎ3 ℎ4
Model
? ? ? ?
w4
w5
ELMo
[Peters, et al.,
NAACL’18]
Universal Language Model Fine-tuning (ULMFiT)
[Howard, et al., ACL’18]
This is exactly how we train language models (LM).
LSTM
![Page 43: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/43.jpg)
Predict Next Token GPT
[Shoeybi, et al., arXiv’19]
Megatron
TuringNLG
w1 w2 w3
Model
w4
Self-attention
w1 w2 w3 w4
with constraint
w3
w1
w2
w4
[Alec, et al., 2019]
[Alec, et al., 2018]
GPT-2
![Page 44: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/44.jpg)
Predict Next TokenThey can do generation.
https://talktotransformer.com/
![Page 45: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/45.jpg)
Predict Next TokenThey can do generation.
![Page 46: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/46.jpg)
律師
![Page 47: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/47.jpg)
混亂
![Page 48: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/48.jpg)
I forced a bot to watch over 1,000 hours of XXX
是一個梗! 人在模仿機器模仿人!!!
![Page 49: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/49.jpg)
w1 w2 w3
ℎ4
Model
w4
w5
You shall know a word by the company it keeps
John Rupert Firth
encoding w4 and its left context
How about the right context!?
![Page 50: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/50.jpg)
w1 w2 w3 w4 w5 w6 w7
LSTM
LSTM
Predict Next Token - Bidirectional
ELMO
![Page 51: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/51.jpg)
Masking Input
Model
w1 w2 w3 w4
w2
MASK
BERT
Transformer
Random Token
(special token)
(no limitation on self-attention)
[Devlin, et al., NAACL’19]
![Page 52: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/52.jpg)
Masking Input
Model
w1 w2 w3 w4
w2
Using context to predict the missing token
![Page 53: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/53.jpg)
Masking Input
• Whole Word Masking (WWM)
• Phrase-level & Entity-level
Source of image: https://arxiv.org/abs/1906.08101
Enhanced Representation through Knowledge Integration (ERNIE)
Is random masking good enough?
[Cui, et al., arXiv’19]
[Sun, et al., ACL’19]
![Page 54: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/54.jpg)
SpanBert[Joshi, et al., TACL’20]
Source of image: https://arxiv.org/abs/1907.10529
![Page 55: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/55.jpg)
SpanBert –Span Boundary Objective (SBO)
w1 w2 w3 w4 w5 w7 w8 w9 w10
Span BERT
3
SBO
w6
![Page 56: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/56.jpg)
SpanBert –Span Boundary Objective (SBO)
Useful in coreference?
w1 w2 w3 w4 w5 w6 w7 w8 w9 w10
Span BERT
2
SBO
w5
![Page 57: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/57.jpg)
XLNet
Transformer-XL
[Yang, et al., NeurIPS’19]
深1 度2 學3 習4
Model
深1度2 學3習4
Model
![Page 58: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/58.jpg)
XLNet
Transformer-XL
[Yang, et al., NeurIPS’19]
深1 度2 學3 習4
Model
深1 度2學3 習4
Model
![Page 59: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/59.jpg)
XLNet
Transformer-XL
[Yang, et al., NeurIPS’19]
深1 度2 學3 習4MASK
深1 度2 學3 習4MASK 深1 度2 學3 習4MASK
![Page 60: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/60.jpg)
BERT cannot talk?
w1 w2 w3 w4
Given partial sequence, predict the next token
w1 w2 w3 MASK
w4
BERT-styleLM-Style
What LM born for Never seen partial sequence
Limited to autoregressive model
(non-autoregressive next time)
![Page 61: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/61.jpg)
MASS / BART
• The pre-train model is a typical seq2seq model.
w1 w2 w3
w5 w6 w7
w4
Attention
w8
ModelModel
w1 w2 w3 w4
Reconstruct the input
MAsked Sequence to Sequence pre-training (MASS) [Song, et al., ICML’19]
Bidirectional and Auto-Regressive Transformers (BART) [Lewis, et al., arXiv’19]
Corrupted
![Page 62: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/62.jpg)
Input Corruption
BART
A B [SEP] C D E
A B [SEP] C D E
A B [SEP] C E
C D E [SEP] A B
D E A B [SEP] C
A B [SEP] E
MASS
(Delete “D”)
Text Infilling
• Permutation / Rotation do not perform well.
• Text Infilling is consistently good.
(permutation)
(rotation)
![Page 63: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/63.jpg)
w1 w2 w3
w5 w6 w7
w4
Attention
w8
ModelModel
Corrupted
BART/MASS
UniLM
Model
Encoder Decoder
Encoder
Decoder
Seq2seq[Dong, et al., NeurIPS’19]
![Page 64: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/64.jpg)
Source of image:https://arxiv.org/pdf/1905.03197.pdf
BERT
GPT
BARTMASS
UniLM
![Page 65: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/65.jpg)
Replace or Not?
ELECTRA
Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA)
the chef cooked the
Model
meal
Predicting yes/not is easier than reconstruction.
Every output position is used.
ate
NO NO NO NOYES
![Page 66: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/66.jpg)
the chef cooked the
Model
mealate
NO NO NO NOYES
the chef cooked the
Small BERT
mealmask
ate
Note: This is not GAN.
![Page 67: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/67.jpg)
Source of image: https://arxiv.org/abs/2003.10555
![Page 68: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/68.jpg)
Sentence Level
Model
w1 w2 w3 w4
Representation for each tokenRepresentation for
whole sequence
![Page 69: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/69.jpg)
You shall know a sentenceby the company it keeps?
Encoder
w1 w2 w3
Decoder
w4 w5 w6 w7
Encoder
w1 w2 w3 w4 w5 w6 w7
similarity
Quick Thought
Skip Thought
Next Sentence
Consecutive?
Encoder
Yes
![Page 70: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/70.jpg)
Model
w1 w2 [SEP]
Yes/No
[CLS]
NSP: Next sentence prediction Robustly optimized BERT approach (RoBERTa)
[Liu, et al., arXiv’19]
SOP: Sentence order prediction
w3 w4 w5
Used in ALBERT
In the original BERT, …..
structBERT (Alice) [Want, et al., ICLR’20]
![Page 71: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/71.jpg)
T5 – Comparison
• Transfer Text-to-Text Transformer (T5)
• Colossal Clean Crawled Corpus (C4)
[Raffel, et al., arXiv’19]
![Page 72: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/72.jpg)
Knowledge
• Enhanced Language RepresentatioN with Informative Entities (ERNIE)
This is another story ……
+
![Page 73: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/73.jpg)
Audio BERT This is another story ……
Audio BERTBERT
深 度 學 習
![Page 74: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/74.jpg)
Reference
• [Lewis, et al., arXiv’19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, arXiv, 2019
• [Raffel, et al., arXiv’19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv, 2019
• [Joshi, et al., TACL’20] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy, SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL, 2020
• [Song, et al., ICML’19] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, MASS: Masked Sequence to Sequence Pre-training for Language Generation, ICML, 2019
• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019
![Page 75: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/75.jpg)
Reference
• [Houlsby, et al., ICML’19] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly, Parameter-Efficient Transfer Learning for NLP, ICML, 2019
• [Hao, et al., EMNLP’19] Yaru Hao, Li Dong, Furu Wei, Ke Xu, Visualizing and Understanding the Effectiveness of BERT, EMNLP, 2019
• [Liu, et al., arXiv’19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, MandarJoshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv, 2019
• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019
• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19
![Page 76: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/76.jpg)
Reference
• [Shoeybi, et al., arXiv’19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, arXiv, 19
• [Lan, et al., ICLR’20]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR, 2020
• [Kitaev, et al., ICLR’20] Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya, Reformer: The Efficient Transformer, ICLR, 2020
• [Beltagy, et al., arXiv’20] Iz Beltagy, Matthew E. Peters, Arman Cohan, Longformer: The Long-Document Transformer, arXiv, 2020
• [Dai, et al., ACL’19] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, ACL, 2019
• [Peters, et al., NAACL’18] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, Deep contextualized word representations, NAACL, 2018
![Page 77: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/77.jpg)
Reference
• [Sanh, et al., NeurIPS workshop’s] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, NeurIPS workshop, 2019
• [Jian, et al., arXiv’19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding, arXiv, 19
• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020
• [Zafrir, et al., NeurIPS workshop 2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat, Q8BERT: Quantized 8Bit BERT, NeurIPS workshop 2019
• [Sun, et al., ACL’20] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, YimingYang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, ACL, 2020
![Page 78: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/78.jpg)
Reference
• [Pennington, et al., EMNLP’14] Jeffrey Pennington, Richard Socher, Christopher Manning, Glove: Global Vectors for Word Representation, EMNLP, 2014
• [Mikolov, et al., NIPS’13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean, Distributed Representations of Words and Phrases and their Compositionality, NIPS, 2013
• [Bojanowski, et al., TACL’17] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, TACL, 2017
• [Su, et al., EMNLP’17] Tzu-Ray Su, Hung-Yi Lee, Learning Chinese Word Representations From Glyphs Of Characters, EMNLP, 2017
• [Liu, et al., ACL’19] Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Multi-Task Deep Neural Networks for Natural Language Understanding, ACL, 2019
• [Stickland, et al., ICML’19] Asa Cooper Stickland, Iain Murray, BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning, ICML, 2019
![Page 79: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/79.jpg)
Reference
• [Howard, et al., ACL’18] Jeremy Howard, Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification, ACL, 2018
• [Alec, et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, Improving Language Understanding by Generative Pre-Training, 2018
• [Devlin, et al., NAACL’19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019
• [Alec, et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, Language Models are Unsupervised Multitask Learners, 2019
• [Want, et al., ICLR’20] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si, StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding, ICLR, 2020
• [Yang, et al., NeurIPS’19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le, XLNet: Generalized Autoregressive Pretraining for Language Understanding, NeurIPS, 2019
![Page 80: BERT and its familyspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/BERT train (v8).pdf · MASS / BART •The pre-train model is a typical seq2seq model. w 1 2 w 3 w 5 w 6 w 7 w 4 Attention](https://reader036.fdocuments.us/reader036/viewer/2022071409/61024961477513259d6eb437/html5/thumbnails/80.jpg)
Reference
• [Cui, et al., arXiv’19] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, ZiqingYang, Shijin Wang, Guoping Hu, Pre-Training with Whole Word Masking for Chinese BERT, arXiv, 2019
• [Sun, et al., ACL’19] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu, ERNIE: Enhanced Representation through Knowledge Integration, ACL, 2019
• [Dong, et al., NeurIPS’19] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, XiaodongLiu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS, 2019