Neural Machine Translation

Spring 20202020-03-12

CMPT 825: Natural Language Processing

!"#!"#$"%&$"'

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

Course Logistics

‣ Project proposal is due today

‣ What problem are you addressing? Why is it interesting?

‣ What specific aspects will your project be on?

‣ Re-implement paper? Compare different methods?

‣ What data do you plan to use?

‣ What is your method?

‣ How do you plan to evaluate? What metrics?

• Statistical MT

• Word-based

• Phrase-based

• Syntactic

Last time

Neural Machine Translation

‣ A single neural network is used to translate from source to target

‣ Architecture: Encoder-Decoder

‣ Two main components:

‣ Encoder: Convert source sentence (input) into a vector/matrix

‣ Decoder: Convert encoding into a sentence in target language (output)

Sequence to Sequence learning (Seq2seq)

• Encode entire input sequence into a single vector (using an RNN)

• Decode one word at a time (again, using an RNN!)

• Beam search for better inference

• Learning is not trivial! (vanishing/exploding gradients)

(Sutskever et al., 2014)

Encoder

ht−1

ht+1 ht+2

This cat is cute

Sentence: This cat is cute

wordembedding

Encoder

ht+1 ht+2

wordembedding

This cat is cute

Encoder

wordembedding

This cat is cute

Encoder

(encoded representation)Sentence: This cat is cute

wordembedding

This cat is cute

<s> ce chat est mianon

Decoder

x′ 1 x′ 2

x′ 3

x′ 4

chat mignonest

x′ 5

mignon

wordembedding

Decoder

x′ 2

x′ 3

x′ 4

chat mignonest

x′ 5

mignon

wordembedding

Decoder

x′ 3

x′ 4

chat mignonest

x′ 5

mignon

wordembedding

Decoder

chat mignon• A conditioned language model

wordembedding

Seq2seq training

‣ Similar to training a language model!

‣ Minimize cross-entropy loss:

‣ Back-propagate gradients through both decoder and encoder

‣ Need a really big corpus

∑t=1

− log P(yt |y1, . . . , yt−1, x1, . . . , xn)

English: Machine translation is cool!

36M sentence pairs

Russian: Машинный перевод - это крутo!

Seq2seq training

(slide credit: Abigail See)

Remember masking

1 1 1 1 0 0

1 0 0 0 0 0

1 1 1 1 1 1

1 1 1 0 0 0

Use masking to help compute loss for batched sequences

Scheduled Sampling

(figure credit: Bengio et al, 2015)

Possible decay schedules(probability using true y decays over time)

How seq2seq changed the MT landscape

MT Progress

(source: Rico Sennrich)

(Wu et al., 2016)

NMT vs SMT

‣ Better performance

‣ Fluency

‣ Longer context

‣ Single NN optimized end-to-end

‣ Less engineering

‣Works out of the box for many language pairs

‣ Requires more data and compute

‣ Less interpretable

‣ Hard to debug

‣ Uncontrollable

‣ Heavily dependent on data - could lead to unwanted biases

‣More parameters

Seq2Seq for more than NMT

Task/Application Input Output

Machine Translation French English

Summarization Document Short Summary

Dialogue Utterance Response

Parsing Sentence Parse tree (as sequence)

Question Answering Context + Question Answer

Cross-Modal Seq2Seq

Task/Application Input Output

Speech Recognition Speech Signal Transcript

Image Captioning Image Text

Video Captioning Video Text

Issues with vanilla seq2seq

‣ A single encoding vector, , needs to capture all the information about source sentence

‣ Longer sequences can lead to vanishing gradients

‣ Overfitting

Bottleneck

‣Longer sequences can lead to vanishing gradients

‣ Overfitting

Bottleneck

Remember alignments?

Attention

‣ The neural MT equivalent of alignment models

‣ Key idea: At each time step during decoding, focus on a particular part of source sentence

‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode)

‣ Usually implemented as a probability distribution over the hidden states of the encoder ( )henc

Seq2seq with attention

Can also use as input for next time step

Computing attention

‣ Encoder hidden states:

‣ Decoder hidden state at time :

‣ First, get attention scores for this time step (we will see what is soon!):

‣ Obtain the attention distribution using softmax:

‣ Compute weighted sum of encoder hidden states:

‣ Finally, concatenate with decoder state and pass on to output layer:

henc1 , . . . , henc

t hdect

get = [g(henc

1 , hdect ), . . . , g(henc

n , hdect )]

αt = softmax (et) ∈ ℝn

∑i=1

αti h

enci ∈ ℝh

[at; hdect ] ∈ ℝ2h

Types of attention

‣ Assume encoder hidden states and decoder hidden state

1. Dot-product attention (assumes equal dimensions for and :

2. Multiplicative attention: , where is a weight matrix

3. Additive attention: where are weight matrices and is a weight vector

h1, h2, . . . , hn z

a bei = g(hi, z) = zThi ∈ ℝ

g(hi, z) = zTWhi ∈ ℝ W

g(hi, z) = vT tanh (W1hi + W2z) ∈ ℝW1, W2 v

‣ Longer sequences can lead to vanishing gradients

‣Overfitting

Bottleneck

Dropout

‣ Form of regularization for RNNs (and any NN in general)

‣ Idea: “Handicap” NN by removing hidden units stochastically

‣ set each hidden unit in a layer to 0 with probability during training ( usually works well)

‣ scale outputs by

‣ hidden units forced to learn more general patterns

‣ Test time: Use all activations (no need to rescale)

pp = 0.5

1/(1 − p)

Handling large vocabularies

‣ Softmax can be expensive for large vocabularies

‣ English vocabulary size: 10K to 100K

P(yi) =exp(wi ⋅ h + bi)

∑|V|j=1 exp(wj ⋅ h + bj)

Expensive to compute

Approximate Softmax

‣ Negative Sampling

‣ Structured softmax

‣ Embedding prediction

Negative Sampling

• Softmax is expensive when vocabulary size is large

(figure credit: Graham Neubig)

Negative Sampling

• Sample just a subset of the vocabulary for negative

• Saw simple negative sampling in word2vec (Mikolov 2013)

Other ways to sample:Importance Sampling (Bengio and Senecal 2003)Noise Contrastive Estimation (Mnih & Teh 2012)

Hierarchical softmax(Morin and Bengio 2005)

(figure credit: Quora)

Class based softmax

‣ Two-layer: cluster words into classes, predict class and then predict word.

(Gooding 2001, Mikolov et al 2011)

‣ Clusters can be based on frequency, random, or word contexts.

Embedding prediction

‣ Directly predict embeddings of outputs themselves

‣ What loss to use? (Kumar and Tsvetkov 2019)

‣ L2? Cosine?

‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball

(slide credit: Graham Neubig)

Generation

How can we use our model (decoder) to generate sentences?

• Sampling: Try to generate a random sentence according the the probability distribution

• Argmax: Try to generate the best sentence, the sentence with the highest probability

Decoding Strategies

‣ Ancestral sampling

‣ Greedy decoding

‣ Exhaustive search

‣ Beam search

Ancestral Sampling

• Randomly sample words one by one

• Provides diverse output (high variance)

(figure credit: Luong, Cho, and Manning)

Greedy decoding

‣ Compute argmax at every step of decoder to generate word

‣ What’s wrong?

Exhaustive search?

‣ Find

‣ Requires computing all possible sequences

‣ complexity!

‣ Too expensive

arg maxy1,...,yT

P(y1, . . . , yT |x1, . . . , xn)

Recall: Beam search (a middle ground)

‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses)

‣ Score of each hypothesis = log probability

‣ Not guaranteed to be optimal

‣ More efficient than exhaustive search

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Beam decoding

Backtrack

Beam decoding

‣ Different hypotheses may produce (end) token at different time steps

‣ When a hypothesis produces , stop expanding it and place it aside

‣ Continue beam search until:

‣ All hypotheses produce OR

‣ Hit max decoding limit T

‣ Select top hypotheses using the normalized likelihood score

‣ Otherwise shorter hypotheses have higher scores

⟨eos⟩

k ⟨eos⟩

∑t=1

log P(yt |y1, . . . , yt−1, x1, . . . , xn)

Under translation• Under translation

• Beam search

• Ensemble

• Coverage models

• Crucial information are left untranslated; premature generation<EOS>

• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Frequent word are translated correctly, rare words are not

• In any corpus, word frequencies are exponentially unbalanced

• Under translation

• Crucial information are left untranslated; premature generation

Under-trans<EOS>

nobody

expects

spanish

inquisition

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

• Under translation

• Crucial information are left untranslated; premature generation<EOS>

Under-trans<EOS>

nobody

expects

spanish

inquisition

nobody <EOS>

'inquisition': 0.49 <EOS>: 0.51

Ensemble

• Similar to voting mechanism, but with probabilities

• multiple models of different parameters (usually from different checkpoints of the same training instance)

• use the output with the highest probability across all models

Concep

P0 XXXXXXX

P2 Ensemble

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN RNN RNN

<SOS> prof loves

prof loves cheese

cheese

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN RNN RNN

prof loves cheese

RNNRNN RNN RNN

prof loves cheese

RNNRNN RNN RNN

prof loves cheese

seq2seq_E049.pt

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

RNN cp3 cp3

prof loves pizza

RNN cp1 cp1

prof loves kebab

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

RNN cp3 cp3

prof loves pizza

RNN cp1 cp1

prof loves kebab

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

……

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

RNN cp3 cp3

prof loves pizza

RNN cp1 cp1

prof loves kebab

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

……

Ensemble

P0 XXXXXXX

P2 Ensemble

RNN cp2 cp2

prof loves cheese

RNN cp3 cp3

prof loves pizza

RNN cp1 cp1

prof loves kebab

seq2seq_E049.pt

seq2seq_E048.pt

seq2seq_E047.pt

Existing challenges with NMT

‣ Out-of-vocabulary words

‣ Low-resource languages

‣ Long-term context

‣ Common sense knowledge (e.g. hot dog, paper jam)

‣ Fairness and bias

‣ Uninterpretable

Out-of-vocabulary words

1. https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s Law

100000

1000000

Frequency Rank1 25000 50000

• rare word are exponentially less frequent than frequent words

• e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models

Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

RNN RNN RNN

<SOS> prof loves

prof loves cheese

cheese

1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

RNN RNN RNN

<SOS> prof loves

prof loves cheese

cheese

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

RNN RNN RNN

<SOS> prof loves

prof loves cheese

cheese

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

RNN RNN RNN

<SOS> prof loves

prof loves cheese

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

• Sees <UNK> at step

• looks at attention weight

• replace <UNK> with the source word

fargmaxiαt,i

*Pointer-Generator Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

RNN RNN RNN

<SOS> prof loves

prof loves cheese

cheese

αt,1 = 0.1

αt,2 = 0.2

αt,3 = 0.7

• binary classifier switch • use decoder output • use copy mechanism

*Pointer-Generator Copy Mechanism

Concep

P0 XXXXXXX

P1 Copy

1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

• Sees <UNK> at step , or

• replace <UNK> with the source word

t pg([hdect ; contextt]) ≤ 0.5

fargmaxiαt,i

*Pointer-Based Dictionary Fusion

Concep

P0 XXXXXXX

P1 Copy

1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

• Sees <UNK> at step , or

• replace <UNK> with translation of the source word

t pg([hdect ; contextt]) ≤ 0.5

dict( fargmaxiαt,i)

Out-of-vocabulary words• Out-of-Vocabulary (OOV) Problem; Rare word problem

• Copy Mechanisms

• Subword / character models

Subword / character models• Character based seq2seq models

• Byte pair encoding (BPE)

• Subword units help model morphology

‣ Uninterpretable

Massively multilingual MT

(Arivazhagan et al., 2019)

‣ Train a single neural network on 103 languages paired with English (remember Interlingua?)

‣ Massive improvements on low-resource languages

‣ Uninterpretable

Next time

‣ Contextualized embeddings

‣ ELMO, BERT, and friends

‣ Transformers

‣ Multi-headed self-attention

Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural...

Documents

Transcript of Neural Machine Translation · Neural Machine Translation Spring 2020 2020-03-12 CMPT 825: Natural...

Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

Pre-Translation for Neural Machine Translation · Neural machine translation sets state-of-the art End-to-End neural network approach to machine translation Comparison to SMT Signiﬁcant

Hallucinations in Neural Machine Translation

Machine Translation: The Neural Frontier

Controlling Neural Machine Translation Formality with ...xingniu.org/pub/syntheticfsmt_aaai20.pdf · Controlling Neural Machine Translation Formality with Synthetic Supervision Xing

Experience of neural machine translation between Indian ...

Revisiting Character-Based Neural Machine Translation with ...

Encoder-Decoder Models for Neural Machine Translation ...

Beyond Raw Performance: Neural Machine Translation Based ...

Neural Machine Translation with Source Dependency ...

SYSTRAN’s Pure Neural Machine Translation Systemsblog.systransoft.com/wp-content/uploads/2016/10/SystranNMTReport.pdf · SYSTRAN’s Pure Neural Machine Translation Systems ...

Google’s Multilingual Neural Machine Translation …Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation Hanock Kwak 2016-11-22 (Tue.) BI Lab

Building Salesforce Neural Machine Translation System

SYSTRAN’s Pure Neural Machine Translation Systems

Are Multilingual Neural Machine Translation Models Better ...

Neural Network Image Processing Machine Translation.

“Found in Translation” –Neural machine translation …...“Found in Translation” –Neural machine translation models for chemical reaction prediction Philippe Schwaller (@phisch124)TheophileGaudin,

Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation Machines)

Google’s Multilingual Neural Machine Translation System ... · PDF fileGoogle’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation MelvinJohnson,MikeSchuster,QuocV.Le,MaximKrikun,YonghuiWu,