Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. ·...
Transcript of Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. ·...
![Page 1: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/1.jpg)
Transformer
李宏毅
Hung-yi Lee
![Page 2: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/2.jpg)
Transformer Seq2seq model with “Self-attention”
![Page 3: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/3.jpg)
Sequence
Hard to parallel !
𝑎4𝑎3𝑎2𝑎1
𝑏4𝑏3𝑏2𝑏1
Previous layer
Next layer
𝑎4𝑎3𝑎2𝑎1
Using CNN to replace RNN
![Page 4: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/4.jpg)
……
……
……
Sequence
Hard to parallel
𝑎4𝑎3𝑎2𝑎1
𝑏4𝑏3𝑏2𝑏1
Previous layer
Next layer
𝑎4𝑎3𝑎2𝑎1
(CNN can parallel)
……
𝑏1 𝑏2 𝑏3 𝑏4
Filters in higher layer can consider longer sequence
Using CNN to replace RNN
![Page 5: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/5.jpg)
Self-Attention
𝑎4𝑎3𝑎2𝑎1
𝑏4𝑏3𝑏2𝑏1
𝑎4𝑎3𝑎2𝑎1
𝑏4𝑏3𝑏2𝑏1
Self-Attention Layer
𝑏1, 𝑏2, 𝑏3, 𝑏4 can be parallelly computed.
𝑏𝑖 is obtained based on the whole input sequence.
You can try to replace any thing that has been done by RNN with self-attention.
![Page 6: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/6.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
𝑎4𝑎3𝑎2𝑎1
Self-attention 𝑞: query (to match others)
𝑘: key (to be matched)
𝑣: information to be extracted
𝑎𝑖 = 𝑊𝑥𝑖
𝑞𝑖 = 𝑊𝑞𝑎𝑖
𝑘𝑖 = 𝑊𝑘𝑎𝑖
𝑣𝑖 = 𝑊𝑣𝑎𝑖
https://arxiv.org/abs/1706.03762
Attention is all you need.
![Page 7: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/7.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
𝛼1,1 𝛼1,2
𝛼1,𝑖 = 𝑞1 ∙ 𝑘𝑖/ 𝑑Scaled Dot-Product Attention:
Self-attention
𝛼1,3 𝛼1,4
dot product
d is the dim of 𝑞 and 𝑘
𝑎4𝑎3𝑎2𝑎1
拿每個 query q 去對每個 key k 做 attention
![Page 8: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/8.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
𝛼1,1 𝛼1,2
Self-attention
𝛼1,3 𝛼1,4
Soft-max
ො𝛼1,1 ො𝛼1,2 ො𝛼1,3 ො𝛼1,4
ො𝛼1,𝑖 = 𝑒𝑥𝑝 𝛼1,𝑖 /𝑗𝑒𝑥𝑝 𝛼1,𝑗
𝑎4𝑎3𝑎2𝑎1
![Page 9: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/9.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
Self-attention
ො𝛼1,1 ො𝛼1,2 ො𝛼1,3 ො𝛼1,4
𝑎4𝑎3𝑎2𝑎1
𝑏1
Considering the whole sequence𝑏1 =
𝑖
ො𝛼1,𝑖𝑣𝑖
![Page 10: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/10.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
Self-attention
ො𝛼2,1 ො𝛼2,2 ො𝛼2,3 ො𝛼2,4
𝑎4𝑎3𝑎2𝑎1
𝑏2
𝑏2 =
𝑖
ො𝛼2,𝑖𝑣𝑖拿每個 query q 去對每個 key k 做 attention
![Page 11: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/11.jpg)
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
𝑎4𝑎3𝑎2𝑎1
𝑏1 𝑏2 𝑏3 𝑏4
Self-Attention Layer
𝑏1, 𝑏2, 𝑏3, 𝑏4 can be parallelly computed.
![Page 12: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/12.jpg)
Self-attention
𝑥1 𝑥2 𝑥3 𝑥4
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
𝑎4𝑎3𝑎2𝑎1
𝑞𝑖 = 𝑊𝑞𝑎𝑖
𝑘𝑖 = 𝑊𝑘𝑎𝑖
𝑣𝑖 = 𝑊𝑣𝑎𝑖
𝑞1𝑞2𝑞3𝑞4 = 𝑊𝑞 𝑎1𝑎2𝑎3𝑎4
= 𝑊𝑘
= 𝑊𝑣
𝑎1𝑎2𝑎3𝑎4
𝑎1𝑎2𝑎3𝑎4𝑣1 𝑣3𝑣4𝑣2
𝑘1 𝑘3𝑘4𝑘2
I
I
I
𝑄
𝐾
𝑉
![Page 13: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/13.jpg)
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
Self-attention
ො𝛼1,1 ො𝛼1,2 ො𝛼1,3 ො𝛼1,4
𝑏1
𝛼1,1 = 𝑞1𝑘1
(ignore 𝑑 for simplicity)
𝛼1,2 = 𝑞1𝑘2
𝛼1,3 = 𝑞1𝑘3 𝛼1,4 = 𝑞1𝑘4 𝑞1
𝑘1
𝑘2
𝑘3
𝑘4
=
𝛼1,1𝛼1,2𝛼1,3𝛼1,4
![Page 14: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/14.jpg)
Self-attention
ො𝛼2,1 ො𝛼2,2 ො𝛼2,3 ො𝛼2,4
𝑏2
𝑏2 =
𝑖
ො𝛼2,𝑖𝑣𝑖
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
𝑞1
𝑘1
𝑘2
𝑘3
𝑘4
=
𝛼1,1𝛼1,2𝛼1,3𝛼1,4
𝑞2
𝛼2,1𝛼2,2𝛼2,3𝛼2,4
𝛼3,1𝛼3,2𝛼3,3𝛼3,4
𝛼4,1𝛼4,2𝛼4,3𝛼4,4
𝐾𝑇𝐴
𝑄
𝑞3 𝑞4
ො𝛼1,1ො𝛼1,2ො𝛼1,3ො𝛼1,4
ො𝛼2,1ො𝛼2,2ො𝛼2,3ො𝛼2,4
ො𝛼3,1ො𝛼3,2ො𝛼3,3ො𝛼3,4
ො𝛼4,1ො𝛼4,2ො𝛼4,3ො𝛼4,4
መ𝐴
![Page 15: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/15.jpg)
Self-attention
ො𝛼2,1 ො𝛼2,2 ො𝛼2,3 ො𝛼2,4
𝑏2
𝑏2 =
𝑖
ො𝛼2,𝑖𝑣𝑖
𝑣1𝑘1𝑞1 𝑣2𝑘2𝑞2 𝑣3𝑘3𝑞3 𝑣4𝑘4𝑞4
ො𝛼1,1ො𝛼1,2ො𝛼1,3ො𝛼1,4
ො𝛼2,1ො𝛼2,2ො𝛼2,3ො𝛼2,4
ො𝛼3,1ො𝛼3,2ො𝛼3,3ො𝛼3,4
ො𝛼4,1ො𝛼4,2ො𝛼4,3ො𝛼4,4
መ𝐴
𝑣1 𝑣3𝑣4𝑣2
𝑉
=𝑏1𝑏2𝑏3𝑏4
O
![Page 16: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/16.jpg)
Self-attention
反正就是一堆矩陣乘法,用 GPU 可以加速
= 𝑊𝑞
= 𝑊𝑘
= 𝑊𝑣
Q
K
V
𝐾𝑇 QAA
AV=
I
O
=
I
II
O
![Page 17: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/17.jpg)
Multi-head Self-attention
𝑘𝑖 𝑣𝑖
𝑎𝑖
𝑞𝑖
(2 heads as example)
𝑞𝑖,2𝑞𝑖,1 𝑘𝑖,2𝑘𝑖,1 𝑣𝑖,2𝑣𝑖,1
𝑘𝑗 𝑣𝑗
𝑎𝑗
𝑞𝑗
𝑞𝑗,2𝑞𝑗,1 𝑘𝑗,2𝑘𝑗,1 𝑣𝑗,2𝑣𝑗,1
𝑏𝑖,1
𝑞𝑖 = 𝑊𝑞𝑎𝑖
𝑞𝑖,1 = 𝑊𝑞,1𝑞𝑖
𝑞𝑖,2 = 𝑊𝑞,2𝑞𝑖
![Page 18: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/18.jpg)
Multi-head Self-attention
𝑘𝑖 𝑣𝑖
𝑎𝑖
𝑞𝑖
(2 heads as example)
𝑞𝑖,2𝑞𝑖,1 𝑘𝑖,2𝑘𝑖,1 𝑣𝑖,2𝑣𝑖,1
𝑘𝑗 𝑣𝑗
𝑎𝑗
𝑞𝑗
𝑞𝑗,2𝑞𝑗,1 𝑘𝑗,2𝑘𝑗,1 𝑣𝑗,2𝑣𝑗,1
𝑏𝑖,1
𝑏𝑖,2
𝑞𝑖 = 𝑊𝑞𝑥𝑖
𝑞𝑖,1 = 𝑊𝑞,1𝑞𝑖
𝑞𝑖,2 = 𝑊𝑞,2𝑞𝑖
![Page 19: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/19.jpg)
Multi-head Self-attention
𝑘𝑖 𝑣𝑖
𝑎𝑖
𝑞𝑖
(2 heads as example)
𝑞𝑖,2𝑞𝑖,1 𝑘𝑖,2𝑘𝑖,1 𝑣𝑖,2𝑣𝑖,1
𝑘𝑗 𝑣𝑗
𝑎𝑗
𝑞𝑗
𝑞𝑗,2𝑞𝑗,1 𝑘𝑗,2𝑘𝑗,1 𝑣𝑗,2𝑣𝑗,1
𝑏𝑖,1
𝑏𝑖,2𝑏𝑖
𝑏𝑖 = 𝑊𝑂𝑏𝑖,1
𝑏𝑖,2
![Page 20: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/20.jpg)
Positional Encoding
• No position information in self-attention.
• Original paper: each position has a unique positional vector 𝑒𝑖 (not learned from data)
• In other words: each 𝑥𝑖 appends a one-hot vector 𝑝𝑖
𝑥𝑖
𝑣𝑖𝑘𝑖𝑞𝑖
𝑎𝑖𝑒𝑖 +
𝑝𝑖 =10
0…………
i-th dim
𝑥𝑖
𝑝𝑖𝑊
𝑊𝐼 𝑊𝑃
𝑊𝐼
𝑊𝑃+
= 𝑥𝑖
𝑝𝑖
𝑎𝑖
𝑒𝑖
![Page 21: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/21.jpg)
source of image: http://jalammar.github.io/illustrated-transformer/
𝑥𝑖
𝑝𝑖𝑊
𝑊𝐼 𝑊𝑃
𝑊𝐼
𝑊𝑃+
= 𝑥𝑖
𝑝𝑖
𝑎𝑖
𝑒𝑖
-1 1
![Page 22: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/22.jpg)
Seq2seq with Attention
𝑥4𝑥3𝑥2𝑥1
ℎ4ℎ3ℎ2ℎ1
Encoder
𝑐1 𝑐2
Decoder
𝑐3𝑐2𝑐1
Review: https://www.youtube.com/watch?v=ZjfjPzXw6og&feature=youtu.be
𝑜1 𝑜2 𝑜3
Self-Attention LayerSelf-Attention
Layer
![Page 23: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/23.jpg)
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
![Page 24: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/24.jpg)
Transformer
Encoder Decoder
Using Chinese to English translation as example
機 器 學 習 <BOS>
machine
machine
learning
![Page 25: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/25.jpg)
…
Transformer
𝑏′
+
𝑎
𝑏
Masked: attend on the generated sequence
attend on the input sequence
https://arxiv.org/abs/1607.06450
Batch Size
𝜇 = 0, 𝜎 = 1
𝜇 = 0,𝜎 = 1
https://www.youtube.com/watch?v=BZh1ltr5Rkg
Layer Norm:
Batch Norm:LayerNorm
Layer
Batch
![Page 26: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/26.jpg)
Attention Visualization
https://arxiv.org/abs/1706.03762
![Page 27: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/27.jpg)
Attention Visualization
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
![Page 28: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/28.jpg)
Multi-headAttention
![Page 29: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/29.jpg)
Example Application
• If you can use seq2seq, you can use transformer.
https://arxiv.org/abs/1801.10198
Summarizer
Document Set
![Page 30: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/30.jpg)
Universal Transformer
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
![Page 31: Transformer & BERTspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture... · 2019. 5. 31. · Self-Attention 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Self-Attention Layer 1, 2, 3, 4can be](https://reader035.fdocuments.us/reader035/viewer/2022071502/61215798c48afc05403077c6/html5/thumbnails/31.jpg)
https://arxiv.org/abs/1805.08318
Self-Attention GAN