Post on 30-Apr-2020
Neural Machine Translation
Spring 20202020-03-12
CMPT 825: Natural Language Processing
!"#!"#$"%&$"'
Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)
Course Logistics
‣ Project proposal is due today
‣ What problem are you addressing? Why is it interesting?
‣ What specific aspects will your project be on?
‣ Re-implement paper? Compare different methods?
‣ What data do you plan to use?
‣ What is your method?
‣ How do you plan to evaluate? What metrics?
• Statistical MT
• Word-based
• Phrase-based
• Syntactic
Last time
Neural Machine Translation
‣ A single neural network is used to translate from source to target
‣ Architecture: Encoder-Decoder
‣ Two main components:
‣ Encoder: Convert source sentence (input) into a vector/matrix
‣ Decoder: Convert encoding into a sentence in target language (output)
Sequence to Sequence learning (Seq2seq)
• Encode entire input sequence into a single vector (using an RNN)
• Decode one word at a time (again, using an RNN!)
• Beam search for better inference
• Learning is not trivial! (vanishing/exploding gradients)
(Sutskever et al., 2014)
Encoder
xt
ht−1
xt+1
ht
xt+2
ht+1 ht+2
xt+3
ht+3
h
This cat is cute
Sentence: This cat is cute
wordembedding
Encoder
x1
h0
xt+1
h1
xt+2
ht+1 ht+2
xt+3
ht+3
h
Sentence: This cat is cute
wordembedding
This cat is cute
x1
h0
x2
h1
x3
h2 h3
x4
h4
Encoder
xt+2
ht+2
xt+3
ht+3
h
Sentence: This cat is cute
wordembedding
This cat is cute
Encoder
x1
h0
x2
h1
x3
h2 h3
x4
h4
(encoded representation)Sentence: This cat is cute
wordembedding
henc
This cat is cute
<s> ce chat est mianon
Decoder
x′ 1 x′ 2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
henc
<s> ce chat est mianon
Decoder
y1
henc
x′ 2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
<s> ce chat est mianon
Decoder
y1 y2
z1
x′ 3
z2
ce
o o
z3
o
x′ 4
z4
o
chat mignonest
x′ 5
z5
o
<e>
mignon
wordembedding
henc
Decoder
y1 y2
z1
y3
z2
ce
o o
z3
o
y4
z4
o
<s> ce chat est mianon
chat mignon• A conditioned language model
est
y5
z5
o
<e>
wordembedding
henc
Seq2seq training
‣ Similar to training a language model!
‣ Minimize cross-entropy loss:
‣ Back-propagate gradients through both decoder and encoder
‣ Need a really big corpus
T
∑t=1
− log P(yt |y1, . . . , yt−1, x1, . . . , xn)
English: Machine translation is cool!
36M sentence pairs
Russian: Машинный перевод - это крутo!
Seq2seq training
(slide credit: Abigail See)
Remember masking
1 1 1 1 0 0
1 0 0 0 0 0
1 1 1 1 1 1
1 1 1 0 0 0
Use masking to help compute loss for batched sequences
Scheduled Sampling
(figure credit: Bengio et al, 2015)
Possible decay schedules(probability using true y decays over time)
How seq2seq changed the MT landscape
MT Progress
(source: Rico Sennrich)
(Wu et al., 2016)
NMT vs SMT
Pros
‣ Better performance
‣ Fluency
‣ Longer context
‣ Single NN optimized end-to-end
‣ Less engineering
‣Works out of the box for many language pairs
Cons
‣ Requires more data and compute
‣ Less interpretable
‣ Hard to debug
‣ Uncontrollable
‣ Heavily dependent on data - could lead to unwanted biases
‣More parameters
Seq2Seq for more than NMT
Task/Application Input Output
Machine Translation French English
Summarization Document Short Summary
Dialogue Utterance Response
Parsing Sentence Parse tree (as sequence)
Question Answering Context + Question Answer
Cross-Modal Seq2Seq
Task/Application Input Output
Speech Recognition Speech Signal Transcript
Image Captioning Image Text
Video Captioning Video Text
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣ Longer sequences can lead to vanishing gradients
‣ Overfitting
henc
Bottleneck
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣Longer sequences can lead to vanishing gradients
‣ Overfitting
henc
Bottleneck
Remember alignments?
Attention
‣ The neural MT equivalent of alignment models
‣ Key idea: At each time step during decoding, focus on a particular part of source sentence
‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode)
‣ Usually implemented as a probability distribution over the hidden states of the encoder ( )henc
i
Seq2seq with attention
(slide credit: Abigail See)
Seq2seq with attention
(slide credit: Abigail See)
Seq2seq with attention
(slide credit: Abigail See)
Seq2seq with attention
(slide credit: Abigail See)
Can also use as input for next time step
y1
Seq2seq with attention
(slide credit: Abigail See)
Computing attention
‣ Encoder hidden states:
‣ Decoder hidden state at time :
‣ First, get attention scores for this time step (we will see what is soon!):
‣ Obtain the attention distribution using softmax:
‣ Compute weighted sum of encoder hidden states:
‣ Finally, concatenate with decoder state and pass on to output layer:
henc1 , . . . , henc
n
t hdect
get = [g(henc
1 , hdect ), . . . , g(henc
n , hdect )]
αt = softmax (et) ∈ ℝn
at =n
∑i=1
αti h
enci ∈ ℝh
[at; hdect ] ∈ ℝ2h
Types of attention
‣ Assume encoder hidden states and decoder hidden state
1. Dot-product attention (assumes equal dimensions for and :
2. Multiplicative attention: , where is a weight matrix
3. Additive attention: where are weight matrices and is a weight vector
h1, h2, . . . , hn z
a bei = g(hi, z) = zThi ∈ ℝ
g(hi, z) = zTWhi ∈ ℝ W
g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝW1, W2 v
Issues with vanilla seq2seq
‣ A single encoding vector, , needs to capture all the information about source sentence
‣ Longer sequences can lead to vanishing gradients
‣Overfitting
henc
Bottleneck
Dropout
‣ Form of regularization for RNNs (and any NN in general)
‣ Idea: “Handicap” NN by removing hidden units stochastically
‣ set each hidden unit in a layer to 0 with probability during training ( usually works well)
‣ scale outputs by
‣ hidden units forced to learn more general patterns
‣ Test time: Use all activations (no need to rescale)
pp = 0.5
1/(1 − p)
Handling large vocabularies
‣ Softmax can be expensive for large vocabularies
‣ English vocabulary size: 10K to 100K
P(yi) =exp(wi ⋅ h + bi)
∑|V|j=1 exp(wj ⋅ h + bj)
Expensive to compute
Approximate Softmax
‣ Negative Sampling
‣ Structured softmax
‣ Embedding prediction
Negative Sampling
• Softmax is expensive when vocabulary size is large
(figure credit: Graham Neubig)
Negative Sampling
• Sample just a subset of the vocabulary for negative
• Saw simple negative sampling in word2vec (Mikolov 2013)
(figure credit: Graham Neubig)
Other ways to sample:Importance Sampling (Bengio and Senecal 2003)Noise Contrastive Estimation (Mnih & Teh 2012)
Hierarchical softmax(Morin and Bengio 2005)
(figure credit: Quora)
Class based softmax
‣ Two-layer: cluster words into classes, predict class and then predict word.
(Gooding 2001, Mikolov et al 2011)
‣ Clusters can be based on frequency, random, or word contexts.
(figure credit: Graham Neubig)
Embedding prediction
‣ Directly predict embeddings of outputs themselves
‣ What loss to use? (Kumar and Tsvetkov 2019)
‣ L2? Cosine?
‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball
(slide credit: Graham Neubig)
Generation
How can we use our model (decoder) to generate sentences?
• Sampling: Try to generate a random sentence according the the probability distribution
• Argmax: Try to generate the best sentence, the sentence with the highest probability
Decoding Strategies
‣ Ancestral sampling
‣ Greedy decoding
‣ Exhaustive search
‣ Beam search
Ancestral Sampling
• Randomly sample words one by one
• Provides diverse output (high variance)
(figure credit: Luong, Cho, and Manning)
Greedy decoding
‣ Compute argmax at every step of decoder to generate word
‣ What’s wrong?
Exhaustive search?
‣ Find
‣ Requires computing all possible sequences
‣ complexity!
‣ Too expensive
arg maxy1,...,yT
P(y1, . . . , yT |x1, . . . , xn)
O(VT)
Recall: Beam search (a middle ground)
‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses)
‣ Score of each hypothesis = log probability
‣ Not guaranteed to be optimal
‣ More efficient than exhaustive search
j
∑t=1
log P(yt |y1, . . . , yt−1, x1, . . . , xn)
Beam decoding
(slide credit: Abigail See)
Beam decoding
(slide credit: Abigail See)
Beam decoding
(slide credit: Abigail See)
Backtrack
(slide credit: Abigail See)
Beam decoding
‣ Different hypotheses may produce (end) token at different time steps
‣ When a hypothesis produces , stop expanding it and place it aside
‣ Continue beam search until:
‣ All hypotheses produce OR
‣ Hit max decoding limit T
‣ Select top hypotheses using the normalized likelihood score
‣ Otherwise shorter hypotheses have higher scores
⟨eos⟩
⟨eos⟩
k ⟨eos⟩
1T
T
∑t=1
log P(yt |y1, . . . , yt−1, x1, . . . , xn)
Under translation• Under translation
• Beam search
• Ensemble
• Coverage models
• Crucial information are left untranslated; premature generation<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation
Under-trans<EOS>
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
<EOS>
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
• Under translation
• Crucial information are left untranslated; premature generation<EOS>
Under-trans<EOS>
!
RNN
<EOS>
nobody
RNN
expects
expects
RNN
the
the
RNN
spanish
spanish
RNN
inquisition
inquisition
RNN
!
<SOS>
RNN
nobody <EOS>
'inquisition': 0.49 <EOS>: 0.51
Ensemble
• Similar to voting mechanism, but with probabilities
• multiple models of different parameters (usually from different checkpoints of the same training instance)
• use the output with the highest probability across all models
Concep
t
P0 XXXXXXX
P2 Ensemble
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNNRNN RNN RNN
prof loves cheese
RNN
<EOS>
RNN
seq2seq_E049.pt
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
'kebab': 0.3 'cheese': 0.55 'pizza': 0.05
……
'kebab': 0.3 'cheese': 0.05 'pizza': 0.55
……
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
'kebab': 0.9 'cheese': 0.05 'pizza': 0.05
……
kebab
kebab
Ensemble
Demo
P0 XXXXXXX
P2 Ensemble
RNN cp2 cp2
prof loves cheese
cp2
<EOS>
cp2
RNN cp3 cp3
prof loves pizza
cp3
<EOS>
cp3
RNN cp1 cp1
prof loves kebab
cp1
<EOS>
cp1
seq2seq_E049.pt
seq2seq_E048.pt
seq2seq_E047.pt
kebab
kebab
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
Out-of-vocabulary words
• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Frequent word are translated correctly, rare words are not
• In any corpus, word frequencies are exponentially unbalanced
OOV
1. https://en.wikipedia.org/wiki/Zipf%27s_law
Zipf’s Law
Occ
urre
nce
Cou
nt
1
10
100
1000
10000
100000
1000000
Frequency Rank1 25000 50000
• rare word are exponentially less frequent than frequent words
• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once
Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Copy Mechanisms
• Subword / character models
• Out-of-Vocabulary (OOV) Problem; Rare word problem
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
! !
RNN
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• Sees <UNK> at step
• looks at attention weight
• replace <UNK> with the source word
t
αt
fargmaxiαt,i
1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
!
RNN RNN RNN
<SOS> prof loves
prof loves cheese
RNN
cheese
<EOS>
! ! !
RNN
<UNK>
prof
liebt
kebab
<UNK>
αt,1 = 0.1
αt,2 = 0.2
αt,3 = 0.7
kebab
• binary classifier switch • use decoder output • use copy mechanism
*Pointer-Generator Copy Mechanism
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
fargmaxiαt,i
*Pointer-Based Dictionary Fusion
Concep
t
P0 XXXXXXX
P1 Copy
1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation
RNN
<UNK>
!
pg
• Sees <UNK> at step , or
• looks at attention weight
• replace <UNK> with translation of the source word
t pg([hdect ; contextt]) ≤ 0.5
αt
dict( fargmaxiαt,i)
Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem
• Copy Mechanisms
• Subword / character models
• Out-of-Vocabulary (OOV) Problem; Rare word problem
OOV
Subword / character models• Character based seq2seq models
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
Subword / character models• Character based seq2seq models
• Byte pair encoding (BPE)
• Subword units help model morphology
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
Massively multilingual MT
(Arivazhagan et al., 2019)
‣ Train a single neural network on 103 languages paired with English (remember Interlingua?)
‣ Massive improvements on low-resource languages
Existing challenges with NMT
‣ Out-of-vocabulary words
‣ Low-resource languages
‣ Long-term context
‣ Common sense knowledge (e.g. hot dog, paper jam)
‣ Fairness and bias
‣ Uninterpretable
Next time
‣ Contextualized embeddings
‣ ELMO, BERT, and friends
‣ Transformers
‣ Multi-headed self-attention