Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural...

87
Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

Transcript of Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural...

Page 1: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Neural Machine Translation

Spring 20202020-03-12

CMPT 825: Natural Language Processing

!"#!"#$"%&$"'

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

Page 2: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Course Logistics

‣ Project proposal is due today

‣ What problem are you addressing? Why is it interesting?

‣ What specific aspects will your project be on?

‣ Re-implement paper? Compare different methods?

‣ What data do you plan to use?

‣ What is your method?

‣ How do you plan to evaluate? What metrics?

Page 3: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

• Statistical MT

• Word-based

• Phrase-based

• Syntactic

Last time

Page 4: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Neural Machine Translation

‣ A single neural network is used to translate from source to target

‣ Architecture: Encoder-Decoder

‣ Two main components:

‣ Encoder: Convert source sentence (input) into a vector/matrix

‣ Decoder: Convert encoding into a sentence in target language (output)

Page 5: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Sequence to Sequence learning (Seq2seq)

• Encode entire input sequence into a single vector (using an RNN)

• Decode one word at a time (again, using an RNN!)

• Beam search for better inference

• Learning is not trivial! (vanishing/exploding gradients)

(Sutskever et al., 2014)

Page 6: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Encoder

xt

ht−1

xt+1

ht

xt+2

ht+1 ht+2

xt+3

ht+3

h

This cat is cute

Sentence: This cat is cute

wordembedding

Page 7: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Encoder

x1

h0

xt+1

h1

xt+2

ht+1 ht+2

xt+3

ht+3

h

Sentence: This cat is cute

wordembedding

This cat is cute

Page 8: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

x1

h0

x2

h1

x3

h2 h3

x4

h4

Encoder

xt+2

ht+2

xt+3

ht+3

h

Sentence: This cat is cute

wordembedding

This cat is cute

Page 9: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Encoder

x1

h0

x2

h1

x3

h2 h3

x4

h4

(encoded representation)Sentence: This cat is cute

wordembedding

henc

This cat is cute

Page 10: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

<s> ce chat est mianon

Decoder

x′ 1 x′ 2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding

henc

Page 11: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

<s> ce chat est mianon

Decoder

y1

henc

x′ 2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding

Page 12: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

<s> ce chat est mianon

Decoder

y1 y2

z1

x′ 3

z2

ce

o o

z3

o

x′ 4

z4

o

chat mignonest

x′ 5

z5

o

<e>

mignon

wordembedding

henc

Page 13: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Decoder

y1 y2

z1

y3

z2

ce

o o

z3

o

y4

z4

o

<s> ce chat est mianon

chat mignon• A conditioned language model

est

y5

z5

o

<e>

wordembedding

henc

Page 14: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq training

‣ Similar to training a language model!

‣ Minimize cross-entropy loss:

‣ Back-propagate gradients through both decoder and encoder

‣ Need a really big corpus

T

∑t=1

− log P(yt |y1, . . . , yt−1, x1, . . . , xn)

English: Machine translation is cool!

36M sentence pairs

Russian: Машинный перевод - это крутo!

Page 15: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq training

(slide credit: Abigail See)

Page 16: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Remember masking

1 1 1 1 0 0

1 0 0 0 0 0

1 1 1 1 1 1

1 1 1 0 0 0

Use masking to help compute loss for batched sequences

Page 17: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Scheduled Sampling

(figure credit: Bengio et al, 2015)

Possible decay schedules(probability using true y decays over time)

Page 18: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

How seq2seq changed the MT landscape

Page 19: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

MT Progress

(source: Rico Sennrich)

Page 20: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

(Wu et al., 2016)

Page 21: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

NMT vs SMT

Pros

‣ Better performance

‣ Fluency

‣ Longer context

‣ Single NN optimized end-to-end

‣ Less engineering

‣Works out of the box for many language pairs

Cons

‣ Requires more data and compute

‣ Less interpretable

‣ Hard to debug

‣ Uncontrollable

‣ Heavily dependent on data - could lead to unwanted biases

‣More parameters

Page 22: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2Seq for more than NMT

Task/Application Input Output

Machine Translation French English

Summarization Document Short Summary

Dialogue Utterance Response

Parsing Sentence Parse tree (as sequence)

Question Answering Context + Question Answer

Page 23: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Cross-Modal Seq2Seq

Task/Application Input Output

Speech Recognition Speech Signal Transcript

Image Captioning Image Text

Video Captioning Video Text

Page 24: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Issues with vanilla seq2seq

‣ A single encoding vector, , needs to capture all the information about source sentence

‣ Longer sequences can lead to vanishing gradients

‣ Overfitting

henc

Bottleneck

Page 25: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Issues with vanilla seq2seq

‣ A single encoding vector, , needs to capture all the information about source sentence

‣Longer sequences can lead to vanishing gradients

‣ Overfitting

henc

Bottleneck

Page 26: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Remember alignments?

Page 27: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Attention

‣ The neural MT equivalent of alignment models

‣ Key idea: At each time step during decoding, focus on a particular part of source sentence

‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode)

‣ Usually implemented as a probability distribution over the hidden states of the encoder ( )henc

i

Page 28: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq with attention

(slide credit: Abigail See)

Page 29: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq with attention

(slide credit: Abigail See)

Page 30: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq with attention

(slide credit: Abigail See)

Page 31: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq with attention

(slide credit: Abigail See)

Can also use as input for next time step

y1

Page 32: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Seq2seq with attention

(slide credit: Abigail See)

Page 33: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Computing attention

‣ Encoder hidden states:

‣ Decoder hidden state at time :

‣ First, get attention scores for this time step (we will see what is soon!):

‣ Obtain the attention distribution using softmax:

‣ Compute weighted sum of encoder hidden states:

‣ Finally, concatenate with decoder state and pass on to output layer:

henc1 , . . . , henc

n

t hdect

get = [g(henc

1 , hdect ), . . . , g(henc

n , hdect )]

αt = softmax (et) ∈ ℝn

at =n

∑i=1

αti h

enci ∈ ℝh

[at; hdect ] ∈ ℝ2h

Page 34: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Types of attention

‣ Assume encoder hidden states and decoder hidden state

1. Dot-product attention (assumes equal dimensions for and :

2. Multiplicative attention: , where is a weight matrix

3. Additive attention: where are weight matrices and is a weight vector

h1, h2, . . . , hn z

a bei = g(hi, z) = zThi ∈ ℝ

g(hi, z) = zTWhi ∈ ℝ W

g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝW1, W2 v

Page 35: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Issues with vanilla seq2seq

‣ A single encoding vector, , needs to capture all the information about source sentence

‣ Longer sequences can lead to vanishing gradients

‣Overfitting

henc

Bottleneck

Page 36: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Dropout

‣ Form of regularization for RNNs (and any NN in general)

‣ Idea: “Handicap” NN by removing hidden units stochastically

‣ set each hidden unit in a layer to 0 with probability during training ( usually works well)

‣ scale outputs by

‣ hidden units forced to learn more general patterns

‣ Test time: Use all activations (no need to rescale)

pp = 0.5

1/(1 − p)

Page 37: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Handling large vocabularies

‣ Softmax can be expensive for large vocabularies

‣ English vocabulary size: 10K to 100K

P(yi) =exp(wi ⋅ h + bi)

∑|V|j=1 exp(wj ⋅ h + bj)

Expensive to compute

Page 38: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Approximate Softmax

‣ Negative Sampling

‣ Structured softmax

‣ Embedding prediction

Page 39: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Negative Sampling

• Softmax is expensive when vocabulary size is large

(figure credit: Graham Neubig)

Page 40: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Negative Sampling

• Sample just a subset of the vocabulary for negative

• Saw simple negative sampling in word2vec (Mikolov 2013)

(figure credit: Graham Neubig)

Other ways to sample:Importance Sampling (Bengio and Senecal 2003)Noise Contrastive Estimation (Mnih & Teh 2012)

Page 42: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Class based softmax

‣ Two-layer: cluster words into classes, predict class and then predict word.

(Gooding 2001, Mikolov et al 2011)

‣ Clusters can be based on frequency, random, or word contexts.

(figure credit: Graham Neubig)

Page 43: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Embedding prediction

‣ Directly predict embeddings of outputs themselves

‣ What loss to use? (Kumar and Tsvetkov 2019)

‣ L2? Cosine?

‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball

(slide credit: Graham Neubig)

Page 44: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Generation

How can we use our model (decoder) to generate sentences?

• Sampling: Try to generate a random sentence according the the probability distribution

• Argmax: Try to generate the best sentence, the sentence with the highest probability

Page 45: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Decoding Strategies

‣ Ancestral sampling

‣ Greedy decoding

‣ Exhaustive search

‣ Beam search

Page 46: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ancestral Sampling

• Randomly sample words one by one

• Provides diverse output (high variance)

(figure credit: Luong, Cho, and Manning)

Page 47: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Greedy decoding

‣ Compute argmax at every step of decoder to generate word

‣ What’s wrong?

Page 48: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Exhaustive search?

‣ Find

‣ Requires computing all possible sequences

‣ complexity!

‣ Too expensive

arg maxy1,...,yT

P(y1, . . . , yT |x1, . . . , xn)

O(VT)

Page 49: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Recall: Beam search (a middle ground)

‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses)

‣ Score of each hypothesis = log probability

‣ Not guaranteed to be optimal

‣ More efficient than exhaustive search

j

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Page 50: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Beam decoding

(slide credit: Abigail See)

Page 51: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Beam decoding

(slide credit: Abigail See)

Page 52: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Beam decoding

(slide credit: Abigail See)

Page 53: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Backtrack

(slide credit: Abigail See)

Page 54: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Beam decoding

‣ Different hypotheses may produce (end) token at different time steps

‣ When a hypothesis produces , stop expanding it and place it aside

‣ Continue beam search until:

‣ All hypotheses produce OR

‣ Hit max decoding limit T

‣ Select top hypotheses using the normalized likelihood score

‣ Otherwise shorter hypotheses have higher scores

⟨eos⟩

⟨eos⟩

k ⟨eos⟩

1T

T

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Page 55: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Under translation• Under translation

• Beam search

• Ensemble

• Coverage models

• Crucial information are left untranslated; premature generation<EOS>

Page 56: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Frequent word are translated correctly, rare words are not

• In any corpus, word frequencies are exponentially unbalanced

• Under translation

• Crucial information are left untranslated; premature generation

Under-trans<EOS>

!

RNN

<EOS>

nobody

RNN

expects

expects

RNN

the

the

RNN

spanish

spanish

RNN

inquisition

inquisition

RNN

!

<SOS>

RNN

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

<EOS>

Page 57: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Frequent word are translated correctly, rare words are not

• In any corpus, word frequencies are exponentially unbalanced

• Under translation

• Crucial information are left untranslated; premature generation<EOS>

Under-trans<EOS>

!

RNN

<EOS>

nobody

RNN

expects

expects

RNN

the

the

RNN

spanish

spanish

RNN

inquisition

inquisition

RNN

!

<SOS>

RNN

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

Page 58: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

• Similar to voting mechanism, but with probabilities

• multiple models of different parameters (usually from different checkpoints of the same training instance)

• use the output with the highest probability across all models

Concep

t

P0 XXXXXXX

P2 Ensemble

Page 59: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

Page 60: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN RNN RNN

prof loves cheese

RNN

<EOS>

RNNRNN RNN RNN

prof loves cheese

RNN

<EOS>

RNNRNN RNN RNN

prof loves cheese

RNN

<EOS>

RNN

seq2seq_E049.pt

Page 61: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

Page 62: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

……

'kebab': 0.3 'cheese': 0.55 'pizza': 0.05

……

'kebab': 0.3 'cheese': 0.05 'pizza': 0.55

……

Page 63: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

'kebab': 0.3 'cheese': 0.55 'pizza': 0.05

……

'kebab': 0.3 'cheese': 0.05 'pizza': 0.55

……

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

……

kebab

kebab

Page 64: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Ensemble

Demo

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

cp2

<EOS>

cp2

RNN cp3 cp3

prof loves pizza

cp3

<EOS>

cp3

RNN cp1 cp1

prof loves kebab

cp1

<EOS>

cp1

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

kebab

kebab

Page 65: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Page 66: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Page 67: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Out-of-vocabulary words

• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Frequent word are translated correctly, rare words are not

• In any corpus, word frequencies are exponentially unbalanced

Page 68: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

OOV

1. https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s Law

Occ

urre

nce

Cou

nt

1

10

100

1000

10000

100000

1000000

Frequency Rank1 25000 50000

• rare word are exponentially less frequent than frequent words

• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once

Page 69: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models

• Out-of-Vocabulary (OOV) Problem; Rare word problem

Page 70: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Page 71: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

<UNK>

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Page 72: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

<UNK>

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Page 73: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

! !

RNN

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

• Sees <UNK> at step

• looks at attention weight

• replace <UNK> with the source word

t

αt

fargmaxiαt,i

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Page 74: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

*Pointer-Generator Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN

<UNK>

!

pg

!

RNN RNN RNN

<SOS> prof loves

prof loves cheese

RNN

cheese

<EOS>

! ! !

RNN

<UNK>

prof

liebt

kebab

<UNK>

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

kebab

• binary classifier switch • use decoder output • use copy mechanism

Page 75: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

*Pointer-Generator Copy Mechanism

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN

<UNK>

!

pg

• Sees <UNK> at step , or

• looks at attention weight

• replace <UNK> with the source word

t pg([hdect ; contextt]) ≤ 0.5

αt

fargmaxiαt,i

Page 76: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

*Pointer-Based Dictionary Fusion

Concep

t

P0 XXXXXXX

P1 Copy

1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

RNN

<UNK>

!

pg

• Sees <UNK> at step , or

• looks at attention weight

• replace <UNK> with translation of the source word

t pg([hdect ; contextt]) ≤ 0.5

αt

dict( fargmaxiαt,i)

Page 77: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models

• Out-of-Vocabulary (OOV) Problem; Rare word problem

Page 78: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

OOV

Page 79: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Subword / character models• Character based seq2seq models

Page 80: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Subword / character models• Character based seq2seq models

• Byte pair encoding (BPE)

Page 81: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Subword / character models• Character based seq2seq models

• Byte pair encoding (BPE)

Page 82: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Subword / character models• Character based seq2seq models

• Byte pair encoding (BPE)

Page 83: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Subword / character models• Character based seq2seq models

• Byte pair encoding (BPE)

• Subword units help model morphology

Page 84: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Page 85: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Massively multilingual MT

(Arivazhagan et al., 2019)

‣ Train a single neural network on 103 languages paired with English (remember Interlingua?)

‣ Massive improvements on low-resource languages

Page 86: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Page 87: Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural Language Processing!"#!"#$"%&$"’ Adapted from slides from Danqi Chen, Karthik Narasimhan,

Next time

‣ Contextualized embeddings

‣ ELMO, BERT, and friends

‣ Transformers

‣ Multi-headed self-attention