Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You...
Transcript of Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You...
![Page 1: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/1.jpg)
Attention Is All You Need (Vaswani et al. 2017)
• Popularized self-attention
• Created the general-purpose Transformer architecture for sequence modeling
• Demonstrated computational savings over other models
![Page 2: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/2.jpg)
Transformers: High-Level
• Sequence-to-sequence model with encoder and decoder
Encoder Decoder
![Page 3: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/3.jpg)
Attention as Representations
• Attention generally used to scoreexisting encoder representations
• Why not use them asrepresentations?
This movie rocks !
![Page 4: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/4.jpg)
Self-Attention
• Every element sees itself in its context
• Attention weight corresponds to an “important” signal
This movie rocks !
This movie rocks !
![Page 5: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/5.jpg)
Self-Attention: Formalized
• Score the energy between a query Q and key K → scalar
• Use softmax-ed energy to take a weighted average of value V→ scalar * vector
![Page 6: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/6.jpg)
Self-Attention: Example
• Score “she” (Q) against “Susan” and “the” (both K and V) in “Susan dropped the plate. She is clumsy”
she
she
the
Susan
0.3
0.7
the
Susan
![Page 7: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/7.jpg)
Masked Self-Attention
• Modeling temporality requires enforcing causal relationships
• Mask out illegal connections in self-attention map
shewent
tothe
store
she
wen
t to the
stor
e
![Page 8: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/8.jpg)
Multi-Head Self-Attention
• Problem: Self-attention is just a weighted average; how do we model complex relationships?
We went to the store at 7pm
Self-Attention
Q K V
![Page 9: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/9.jpg)
We went to the store at 7pm
Multi-Head Self-Attention
• Solution: Use multiple self-attention heads!
Self-Attention
Q K V Q K VQ K V
![Page 10: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/10.jpg)
Position-wise FFNN
• Feed-forward network mixes multi-head self-attention by operating on each position independently
we
went
Head 1Head 2Head 3
Head 1Head 2Head 3
FFNN
FFNN
hidden
hidden
sequence lengthx hidden dim
![Page 11: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/11.jpg)
Positional Embeddings
• No convolutions or recurrence; use sinusoids to inject positional information into the model• Embedding is a function of position and dimension
![Page 12: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/12.jpg)
Transformer: Full Model
Vaswani et al. (2017)
![Page 13: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/13.jpg)
Results: MT
Vaswani et al. (2017)
![Page 14: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/14.jpg)
Results: Constituency Parsing
Vaswani et al. (2017)
![Page 15: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/15.jpg)
Why Transformers?
• Self-attention is flexible
• Highly modular and extensible
• Demonstrated empirical performance
![Page 16: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/16.jpg)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)
• Deep bidirectional Transformer architecture for (masked) language modeling
• Advances SOTA on 11 NLP tasks including GLUE, MNLI, and SQuAD
![Page 17: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/17.jpg)
Background: ELMo
• Revitalized research in pretraining: creating unsupervised tasks from large unlabeled corpora (e.g., word2vec)
this movie rocks !
Char CNN Char CNN Char CNN Char CNN wordembedding
forwardLSTM
contextualembedding
backwardLSTM
![Page 18: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/18.jpg)
BERT• Deeply bidirectional as opposed to ELMo (only a shallow
concatenation of LMs)
• Introduces two pretraining tasks:• Masked Language Modeling• Next Sentence Prediction
![Page 19: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/19.jpg)
Pretraining: Masked Language Modeling
• Problem: bidirectional language modeling not possible as each token “sees” itself in context
• Solution: introduce a cloze-style task where the model tries to predict the missing word ([MASK])
we went to the store
went to the storeOutputs
Inputs
[MASK] went [MASK] the store
![Page 20: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/20.jpg)
Pretraining: Next Sentence Prediction
• To learn inter-sentential relationships, determine if sentence B followssentence A; randomly sample sentence B
I went to the store at 7pm. The store had lots of fruit!
Selena Gomez is an American singer. Variational autoencoders are cool.
Sentence A Sentence B
Sentence A Sentence B
![Page 21: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/21.jpg)
BERT: Inputs
Devlin et al. (2019)
![Page 22: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/22.jpg)
BERT: Pretraining
BERT
[CLS] [SEP] [SEP]Segment A Segment B
NSP MASK MASK MASK MASK
Sentence representations are stored in [CLS]
Bidirectional representations are used to predict [MASK]
![Page 23: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/23.jpg)
BERT: Fine-Tuning
BERT
[CLS] [SEP] [SEP]Premise Hypothesis
Features MLP
Entailment
Contradiction
Neutral
0.8
0.05
0.15
![Page 24: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/24.jpg)
Results: GLUE
Devlin et al. (2019)
![Page 25: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/25.jpg)
Results: SQuAD
Devlin et al. (2019)
![Page 26: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/26.jpg)
Ablation: Pretraining Tasks
• No NSP: BERT trained without next sentence prediction• LTR & No NSP: Regular LM without next sentence prediction
![Page 27: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/27.jpg)
Ablation: Model Size
• Increasing capacity consistently increases capacity; also consistent with future work (e.g., GPT-2, RoBERTa, etc.)
![Page 28: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose](https://reader033.fdocuments.us/reader033/viewer/2022042015/5e742f551e410c7c88124877/html5/thumbnails/28.jpg)
2019: The Year of Pretraining
GPT-2 XLM XLNet RoBERTa
ELECTRA ALBERT T5 BART