MTMA16-Neural Machine Translation

97
Neural Machine Translation Rico Sennrich Institute for Language, Cognition and Computation University of Edinburgh May 18 2016 Rico Sennrich Neural Machine Translation 1 / 65

Transcript of MTMA16-Neural Machine Translation

Page 1: MTMA16-Neural Machine Translation

Neural Machine Translation

Rico Sennrich

Institute for Language, Cognition and ComputationUniversity of Edinburgh

May 18 2016

Rico Sennrich Neural Machine Translation 1 / 65

Page 2: MTMA16-Neural Machine Translation

Neural Machine Translation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 1 / 65

Page 3: MTMA16-Neural Machine Translation

1 neural network crash course

2 introduction to neural machine translationneural language modelsattentional encoder-decoder

3 recent research, opportunities and challenges in neural machinetranslation

Rico Sennrich Neural Machine Translation 2 / 65

Page 4: MTMA16-Neural Machine Translation

Neural Machine Translation

1 neural network crash course

2 introduction to neural machine translationneural language modelsattentional encoder-decoder

3 recent research, opportunities and challengesin neural machine translation

Rico Sennrich Neural Machine Translation 3 / 65

Page 5: MTMA16-Neural Machine Translation

Building block: artificial neurons

analogy to biological neuronsinput ≈ dendrites

activation function ≈ neuron ’fires’ if voltage threshold is reached

output ≈ axon

© Christoph Burgmer CC-BY-SA-3.0https://commons.wikimedia.org/wiki/File:ArtificialNeuronModel_english.png

Rico Sennrich Neural Machine Translation 4 / 65

Page 6: MTMA16-Neural Machine Translation

A simple neural network

neural networks can solve non-linear functions.

XORTruth table

A B output0 0 00 1 11 0 11 1 0

A

B

C

D

x1

x2

y

1

1

1

-2

1

0.5

0.5

(neurons arranged in layers, and fire if input is ≥ 1)

Rico Sennrich Neural Machine Translation 5 / 65

Page 7: MTMA16-Neural Machine Translation

A simple neural network: math

neural networks can be implemented via matrix operations

network

A

B

C

D

x1

x2

y

1

0

0

1

1

-2

1

0.5

0.5

w1 =

1 00.5 0.50 1

h1 =

ABC

x =

[x1x2

]

w2 =[1 −2 1

]y =

[D]

calculation of x 7→ y

ϕ(z) = z ≥ 1

h1 = ϕ(w1x)

y = ϕ(w2h1)

Rico Sennrich Neural Machine Translation 6 / 65

Page 8: MTMA16-Neural Machine Translation

A simple neural network: Python code

import numpy

#activation function

def phi(x):

return numpy.greater_equal(x,1).astype(int)

def nn(x):

w1 = numpy.array([ [1, 0.5, 0], [0, 0.5, 1] ])

w2 = numpy.array([[1], [-2], [1]])

h1 = phi(x.dot(w1))

h2 = phi(h1.dot(w2))

return h2

print nn(numpy.array([1, 0]))

Rico Sennrich Neural Machine Translation 7 / 65

Page 9: MTMA16-Neural Machine Translation

Training a neural network

Gradient descentrequirements:

labelled training data (supervised learning)differentiable objective function

in forward pass, compute network output

compare output to true label to compute error

move all parameters in direction that minimizes error

chain rule allows efficient computation of gradient for each parameterin backward pass→ backpropagation of error

we approximate gradient on small minibatches to perform frequentupdates→ stochastic gradient descent

Rico Sennrich Neural Machine Translation 8 / 65

Page 10: MTMA16-Neural Machine Translation

Activation functions

Activation functionsdesirable:

differentiable (for stochastic gradient descent)monotonic (for convexity)non-linear activation functions essential to learn non-linear functions

−3 −2 −1 1 2 3

−1

1

2

3

x

y

identity (linear)sigmoid

tanhrectified linear unit (ReLU)

Rico Sennrich Neural Machine Translation 9 / 65

Page 11: MTMA16-Neural Machine Translation

Further basics

hyperparameters:number and size of layersminibatch sizelearning rate...

initialisation of weight matrices

stopping criterion

regularization (dropout)

bias units (always-on input)

more complex architectures (recurrent/convolutional networks)

Rico Sennrich Neural Machine Translation 10 / 65

Page 12: MTMA16-Neural Machine Translation

Resources

Theano http://deeplearning.net/software/theano/

Torch http://torch.ch/

Tensorflow https://www.tensorflow.org/

toolkits provide useful abstractions for neural networks:routines for n-dimensional arrays (tensors)simple use of different linear algebra backends (CPU/GPU)automatic differentiation

Rico Sennrich Neural Machine Translation 11 / 65

Page 13: MTMA16-Neural Machine Translation

Neural Machine Translation

1 neural network crash course

2 introduction to neural machine translationneural language modelsattentional encoder-decoder

3 recent research, opportunities and challengesin neural machine translation

Rico Sennrich Neural Machine Translation 12 / 65

Page 14: MTMA16-Neural Machine Translation

Language modelling

chain rule and Markov assumptiona sentence T of length n is a sequence w1, . . . , wn

p(T ) = p(w1, . . . , wn)

=

n∏i=1

p(wi|w0, . . . , wi−1) (chain rule)

≈n∏

i=1

p(wi|wi−k, . . . , wi−1) (Markov assumption: n-gram model)

Rico Sennrich Neural Machine Translation 13 / 65

Page 15: MTMA16-Neural Machine Translation

N-gram language model with feedforward neural network

[Vaswani et al., 2013]

n-gram NNLM [Bengio et al., 2003]input: context of n-1 previous words

output: probability distribution for next word

linear embedding layer with shared weights

one or several hidden layers

Rico Sennrich Neural Machine Translation 14 / 65

Page 16: MTMA16-Neural Machine Translation

Representing words as vectors

One-hot encodingexample vocabulary: ’man, ’runs’, ’the’, ’.’

input/output for p(runs|the man):

x0 =

0010

x1 =

1000

ytrue =

0100

size of input/output vector: vocabulary sizeembedding layer is lower-dimensional and dense

smaller weight matricesnetwork learns to group similar words to similar point in vector space

Rico Sennrich Neural Machine Translation 15 / 65

Page 17: MTMA16-Neural Machine Translation

Softmax activation function

softmax function

p(y = j|x) = exj

∑k e

xk

softmax function normalizes output vector to probability distribution→ computational cost linear to vocabulary size (!)

ideally: probability 1 for correct word; 0 for rest

SGD with softmax output minimizes cross-entropy (and henceperplexity) of neural network

Rico Sennrich Neural Machine Translation 16 / 65

Page 18: MTMA16-Neural Machine Translation

Feedforward neural language model: math

[Vaswani et al., 2013]

h1 = ϕW1(Ex1, Ex2)

y = softmax(W2h1)

Rico Sennrich Neural Machine Translation 17 / 65

Page 19: MTMA16-Neural Machine Translation

Feedforward neural language model in SMT

FFNLMcan be integrated as a feature in the log-linear SMT model[Schwenk et al., 2006]

costly due to matrix multiplications and softmaxsolutions:

n-best rerankingvariants of softmax (hierarchical softmax, self-normalization [NCE])shallow networks; premultiplication of hidden layer

scale well to many input words→ models with source context [Devlin et al., 2014]

Rico Sennrich Neural Machine Translation 18 / 65

Page 20: MTMA16-Neural Machine Translation

Recurrent neural network language model (RNNLM)

RNNLMmotivation: condition on arbitrarily long context→ no Markov assumption

we read in one word at a time, and update hidden state incrementally

hidden state is initialized as empty vector at time step 0parameters:

embedding matrix Efeedforward matrices W1, W2

recurrent matrix U

hi =

{0, , if i = 0

tanh(W1Exi + Uhi−1) , if i > 0

yi = softmax(W2hi−1)

Rico Sennrich Neural Machine Translation 19 / 65

Page 21: MTMA16-Neural Machine Translation

RNN variants

gated unitsalternative to plain RNN

sigmoid layers σ act as “gates” that control flow of information

allows passing of information over long time→ avoids vanishing gradient problem

strong empirical resultspopular variants:

Long Short Term Memory (LSTM) (shown)Gated Recurrent Unit (GRU)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/Rico Sennrich Neural Machine Translation 20 / 65

Page 22: MTMA16-Neural Machine Translation

RNN variants

gated unitsalternative to plain RNN

sigmoid layers σ act as “gates” that control flow of information

allows passing of information over long time→ avoids vanishing gradient problem

strong empirical resultspopular variants:

Long Short Term Memory (LSTM) (shown)Gated Recurrent Unit (GRU)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/Rico Sennrich Neural Machine Translation 20 / 65

Page 23: MTMA16-Neural Machine Translation

Neural Machine Translation

1 neural network crash course

2 introduction to neural machine translationneural language modelsattentional encoder-decoder

3 recent research, opportunities and challengesin neural machine translation

Rico Sennrich Neural Machine Translation 21 / 65

Page 24: MTMA16-Neural Machine Translation

Translation modelling

decomposition of translation problem (for NMT)a source sentence S of length m is a sequence x1, . . . , xma target sentence T of length n is a sequence y1, . . . , yn

T ∗ = argmaxt

p(T |S)

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

Rico Sennrich Neural Machine Translation 22 / 65

Page 25: MTMA16-Neural Machine Translation

Translating with RNNs

Encoder-decoder [Sutskever et al., 2014, Cho et al., 2014]two RNNs (LSTM or GRU):

encoder reads input and produces hidden state representationsdecoder produces output, based on last encoder hidden state

encoder and decoder are learned jointly→ supervision signal from parallel text is backpropagated

Kyunghyun Cho http://devblogs.nvidia.com/parallelforall/

introduction-neural-machine-translation-gpus-part-2/

Rico Sennrich Neural Machine Translation 23 / 65

Page 26: MTMA16-Neural Machine Translation

Neural machine translation: information bottleneck

summary vectorlast encoder hidden-state “summarizes” source sentence

can fixed-size vector represent meaning of arbitrarily long sentence?

empirically, quality decreases for long sentences

reversing source sentence brings some improvement[Sutskever et al., 2014]

[Sutskever et al., 2014]

Rico Sennrich Neural Machine Translation 24 / 65

Page 27: MTMA16-Neural Machine Translation

Attentional encoder-decoder

encodergoal: avoid bottleneck of summary vector

use bidirectional RNN, and concatenate forward and backward states→ annotation vector hirepresent source sentence as vector of n annotations→ variable-length representation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 25 / 65

Page 28: MTMA16-Neural Machine Translation

Attentional encoder-decoder

attentionproblem: how to incorporate variable-length context into hidden state?

attention model computes context vector as weighted average ofannotations

weights are computed by feedforward neural network with softmaxactivation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 26 / 65

Page 29: MTMA16-Neural Machine Translation

Attentional encoder-decoder: math

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

notationW , U , E, C, V are weight matrices (of different dimensionality)

Ex one-hot to embedding (e.g. 50000 · 512)Wx embedding to hidden (e.g. 512 · 1024)Ux hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)...

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length Ty

Rico Sennrich Neural Machine Translation 27 / 65

Page 30: MTMA16-Neural Machine Translation

Attentional encoder-decoder: math

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

Rico Sennrich Neural Machine Translation 28 / 65

Page 31: MTMA16-Neural Machine Translation

Attentional encoder-decoder: math

decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi−1 + VoEyyi−1 + Coci)

yi = softmax(Woti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑

j=1

αijhj

Rico Sennrich Neural Machine Translation 29 / 65

Page 32: MTMA16-Neural Machine Translation

Attention model

attention modelside effect: we obtain alignment between source and target sentenceapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...?

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Rico Sennrich Neural Machine Translation 30 / 65

Page 33: MTMA16-Neural Machine Translation

Attention model

attention model also works with images:

[Cho et al., 2015]

Rico Sennrich Neural Machine Translation 31 / 65

Page 34: MTMA16-Neural Machine Translation

Attention model

[Cho et al., 2015]

Rico Sennrich Neural Machine Translation 32 / 65

Page 35: MTMA16-Neural Machine Translation

Neural machine translation: decoding

decoding

exact search intractable: N |vocab| for each possible output length N→ approximative search for best translationgiven decoder state si, compute probability of each output word yi

sampling: pick a random word (considering probability)greedy search: pick the most probable wordbeam search: pick the k most probable words, and compute yi+1 foreach hypothesis in beam

beam search with small beam (k ≈ 10) seems sufficient for neuralmachine translation

ensemble: compute probability distribution for next word with multiplemodels, and use (geometric) average

Rico Sennrich Neural Machine Translation 33 / 65

Page 36: MTMA16-Neural Machine Translation

Further Reading

secondary literaturelecture notes by Kyunghyun Cho: [Cho, 2015]

introduction to LSTM (and GRU):http://colah.github.io/posts/2015-08-Understanding-LSTMs/

“Statistical Machine Translation” by Philipp Koehn (unpublished 2nd edition)

Rico Sennrich Neural Machine Translation 34 / 65

Page 37: MTMA16-Neural Machine Translation

(A small selection of) Resources

feedforward neural LM toolkitsCSLM http://www-lium.univ-lemans.fr/cslm/

NPLM https://github.com/moses-smt/nplm

OXLM https://github.com/pauldb89/OxLM

NMT toolsdl4mt-tutorial (theano) https://github.com/nyu-dl/dl4mt-tutorial

(our branch: nematus https://github.com/rsennrich/nematus)

nmt.matlab https://github.com/lmthang/nmt.matlab

seq2seq (tensorflow) https://www.tensorflow.org/versions/r0.8/tutorials/seq2seq/index.html

Rico Sennrich Neural Machine Translation 35 / 65

Page 38: MTMA16-Neural Machine Translation

Do it yourself

sample files and instructions for training NMT modelhttps://github.com/rsennrich/wmt16-scripts

pre-trained models to test decoding (and for further experiments)http://statmt.org/rsennrich/wmt16_systems/

please let me know about gaps in documentation

NMT tools (previous slide) may also contain instructions/samples

Rico Sennrich Neural Machine Translation 36 / 65

Page 39: MTMA16-Neural Machine Translation

Neural Machine Translation

1 neural network crash course

2 introduction to neural machine translationneural language modelsattentional encoder-decoder

3 recent research, opportunities and challengesin neural machine translation

Rico Sennrich Neural Machine Translation 37 / 65

Page 40: MTMA16-Neural Machine Translation

State of Neural MT

Attentional encoder-decoder networks are state of the art in MT

similar models used for other NLP tasks

Rico Sennrich Neural Machine Translation 38 / 65

Page 41: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

system BLEU

uedin-nmt 34.2metamind 32.3

NYU-UMontreal 30.8cambridge 30.6

uedin-syntax 30.6KIT/LIMSI 29.1

KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6

Table: WMT16 results for EN→DE

system BLEU

uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5

uedin-syntax 34.4KIT 33.9

jhu-syntax 31.0

Table: WMT16 results for DE→EN

pure NMT

NMT component

other neural components

Rico Sennrich Neural Machine Translation 39 / 65

Page 42: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

system BLEU

uedin-nmt 34.2metamind 32.3

NYU-UMontreal 30.8cambridge 30.6

uedin-syntax 30.6KIT/LIMSI 29.1

KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6

Table: WMT16 results for EN→DE

system BLEU

uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5

uedin-syntax 34.4KIT 33.9

jhu-syntax 31.0

Table: WMT16 results for DE→EN

pure NMT

NMT component

other neural components

Rico Sennrich Neural Machine Translation 39 / 65

Page 43: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

system BLEU

uedin-nmt 34.2metamind 32.3

NYU-UMontreal 30.8cambridge 30.6

uedin-syntax 30.6KIT/LIMSI 29.1

KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6

Table: WMT16 results for EN→DE

system BLEU

uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5

uedin-syntax 34.4KIT 33.9

jhu-syntax 31.0

Table: WMT16 results for DE→EN

pure NMT

NMT component

other neural components

Rico Sennrich Neural Machine Translation 39 / 65

Page 44: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

system BLEU

uedin-nmt 34.2metamind 32.3

NYU-UMontreal 30.8cambridge 30.6

uedin-syntax 30.6KIT/LIMSI 29.1

KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6

Table: WMT16 results for EN→DE

system BLEU

uedin-nmt 38.6uedin-pbmt 35.1

jhu-pbmt 34.5uedin-syntax 34.4

KIT 33.9jhu-syntax 31.0

Table: WMT16 results for DE→EN

pure NMT

NMT component

other neural components

Rico Sennrich Neural Machine Translation 39 / 65

Page 45: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

uedin-nmt 25.8NYU-UMontreal 23.6

jhu-pbmt 23.6cu-chimera 21.0

uedin-cu-syntax 20.9cu-tamchyna 20.8cu-TectoMT 14.7

cu-mergedtrees 8.2

Table: WMT16 results for EN→CS

uedin-pbmt 35.2uedin-nmt 33.9

uedin-syntax 33.6jhu-pbmt 32.2

LIMSI 31.0

Table: WMT16 results for RO→EN

uedin-nmt 31.4jhu-pbmt 30.4PJATK 28.3

cu-mergedtrees 13.3

Table: WMT16 results for CS→EN

QT21-HimL-SysComb 28.9uedin-nmt 28.1

RWTH-SYSCOMB 27.1uedin-pbmt 26.8

uedin-lmu-hiero 25.9KIT 25.8

lmu-cuni 24.3LIMSI 23.9

jhu-pbmt 23.5usfd-rescoring 23.1

Table: WMT16 results for EN→RO

Rico Sennrich Neural Machine Translation 39 / 65

Page 46: MTMA16-Neural Machine Translation

Attentional encoder-decoders (NMT) are SOTA

uedin-nmt 26.0amu-uedin 25.3jhu-pbmt 24.0

LIMSI 23.6AFRL-MITLL 23.5

NYU-UMontreal 23.1AFRL-MITLL-verb-annot 20.9

Table: WMT16 results for EN→RU

amu-uedin 29.1NRC 29.1

uedin-nmt 28.0AFRL-MITLL 27.6

AFRL-MITLL-contrast 27.0

Table: WMT16 results for RU→EN

uedin-pbmt 23.4uedin-syntax 20.4PROMT-SMT 20.3UH-factored 19.3

jhu-pbmt 19.1

Table: WMT16 results for FI→EN

abumatran-combo 17.4abumatra-nmt 17.2

NYU-UMontreal 15.1abumatran-pbsmt 14.6

jhu-pbmt 13.8UH-factored 12.8jhu-hltcoe 11.9

UUT 11.6aalto 11.6

Table: WMT16 results for EN→FI

Rico Sennrich Neural Machine Translation 39 / 65

Page 47: MTMA16-Neural Machine Translation

Selected examples from WMT16

system sentencesource Unsere digitalen Leben haben die Notwendigkeit, stark, lebenslustig und erfolgreich zu erscheinen, verdoppelt [...]reference Our digital lives have doubled the need to appear strong, fun-loving and successful [...]uedin-nmt Our digital lives have doubled the need to appear strong, lifelike and successful [...]uedin-pbsmt Our digital lives are lively, strong, and to be successful, doubled [...]

Rico Sennrich Neural Machine Translation 40 / 65

Page 48: MTMA16-Neural Machine Translation

Selected examples from WMT16system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 41 / 65

Page 49: MTMA16-Neural Machine Translation

Selected examples from WMT16system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attacker

racket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 41 / 65

Page 50: MTMA16-Neural Machine Translation

Selected examples from WMT16system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 41 / 65

Page 51: MTMA16-Neural Machine Translation

Selected examples from WMT16system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Rico Sennrich Neural Machine Translation 41 / 65

Page 52: MTMA16-Neural Machine Translation

Selected examples from WMT16

system sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-nmt A year later, FedEx officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.

Rico Sennrich Neural Machine Translation 42 / 65

Page 53: MTMA16-Neural Machine Translation

Selected examples from WMT16

system sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-nmt A year later, FedEx officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.

Rico Sennrich Neural Machine Translation 42 / 65

Page 54: MTMA16-Neural Machine Translation

Selected examples from WMT16

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.uedin-pbsmt Title defender Drittligaabsteiger Week 2.

Rico Sennrich Neural Machine Translation 43 / 65

Page 55: MTMA16-Neural Machine Translation

Selected examples from WMT16

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.uedin-pbsmt Title defender Drittligaabsteiger Week 2.

Rico Sennrich Neural Machine Translation 43 / 65

Page 56: MTMA16-Neural Machine Translation

Comparison between phrase-based and neural MT

pro neural MTimproved grammaticality [Neubig et al., 2015]

word orderinsertion/deletion of function wordsmorphological agreement

pro phrase-based/syntax-based SMTminor degradation in lexical choice? [Neubig et al., 2015]

othersrare/unseen words are problematic for both:

PBSMT suffers from data sparseness and noisy phrase alignment(our) NMT system attempts subword-level translation

Rico Sennrich Neural Machine Translation 44 / 65

Page 57: MTMA16-Neural Machine Translation

Why is neural MT output more grammatical?

neural MTend-to-end trained model

generalization via continuous space representation

output conditioned on full source text and target history

phrase-based SMTlog-linear combination of many “weak” features

data sparsenesss triggers back-off to smaller units

strong independence assumptions

Rico Sennrich Neural Machine Translation 45 / 65

Page 58: MTMA16-Neural Machine Translation

Efficiency

speed bottlenecksmatrix multiplication→ use of highly parallel hardware (GPUs)softmax (scales with vocabulary size). Solutions:

LMs: hierarchical softmax; noise-contrastive estimation;self-normalizationNMT: approximate softmax through subset of vocabulary[Jean et al., 2015]

NMT training vs. decoding (on fast GPU)training: slow (1-3 weeks)

decoding: fast (100 000–500 000 sentences / day)a

awith NVIDIA Titan X and amuNN (https://github.com/emjotde/amunn)

Rico Sennrich Neural Machine Translation 46 / 65

Page 59: MTMA16-Neural Machine Translation

Open-vocabulary translation

Why is vocabulary size a problem?size of one-hot input/output vector is linear to vocabulary size

large vocabularies are space inefficient

large output vocabularies are time inefficient

typical network vocabulary size: 30 000–100 000

What about out-of-vocabulary words?training set vocabulary typically larger than network vocabulary(1 million words or more)at translation time, we regularly encounter novel words:

names: Barack Obamamorph. complex words: Hand|gepäck|gebühr (’carry-on bag fee’)numbers, URLs etc.

Rico Sennrich Neural Machine Translation 47 / 65

Page 60: MTMA16-Neural Machine Translation

Open-vocabulary translation

Solutionscopy unknown words, or translate with back-off dictionary[Jean et al., 2015, Luong et al., 2015b, Gülçehre et al., 2016]→ works for names (if alphabet is shared), and 1-to-1 aligned words

use subword units (characters or others) for input/output vocabulary→ model can learn translation of seen words on subword level→ model can translate unseen words if translation is transparent

Rico Sennrich Neural Machine Translation 48 / 65

Page 61: MTMA16-Neural Machine Translation

Transparent translations

Transparent translationssome translations are semantically/phonologically transparent→ no memorization needed; can be translated via subword unitsmorphologically complex words (e.g. compounds):

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)

named entities:Barack Obama (English; German)Áàðàê Îáàìà (Russian)バラク・オバマ (ba-ra-ku o-ba-ma) (Japanese)

cognates and loanwords:claustrophobia (English)Klaustrophobie (German)Êëàóñòðîôîáèÿ (Russian)

many rare/unseen words belong to one of these categories

Rico Sennrich Neural Machine Translation 49 / 65

Page 62: MTMA16-Neural Machine Translation

Subword neural machine translation

Flat representation [Sennrich et al., 2015b, Chung et al., 2016]sentence is a sequence of subword units

Hierarchical representation[Ling et al., 2015, Luong and Manning, 2016]

sentence is a sequence of words

words are a sequence of subword units

Under review as a conference paper at ICLR 2016

variables, the source attention a and the target context lfp−1, the probability of a given word type tp

being the next translated word tp is given by:

P (tp|a, lfp−1) =exp(eS

tp

a a+Stp

l lfp−1)∑

j∈[0,T ] exp(eSj

aa+Sjl l

fp−1)

,

where Sa and Sl are the parameters that map the conditioned vectors into a score for each word typein the target language vocabulary T . The parameters for a specific word type j are obtained as Sj

a

and Sjl , respectively. Then, scores are normalized into a probability.

2.2 CHARACTER-BASED MACHINE TRANSLATION

We now present our adaptation of the word-based neural network model to operate over charactersequences rather than word sequences. However, unlike previous approaches that attempt to discardthe notion of words completely (Vilar et al., 2007; Neubig et al., 2013), we propose an hierarhicalarchitecture, which replaces the word lookup tables (steps 1 and 3) and the word softmax (step 6)with character-based alternatives, which compose the notion of words from individual characters.The advantage of this approach is that we benefit from properties of character-based approaches (e.g.compactness and orthographic sensitivity), but can also easily be incorporated into any word-basedneural approaches.

Character-based Word Representation The work in (Ling et al., 2015; Ballesteros et al., 2015)proposes a compositional model for learning word vectors from characters. Similar to word lookuptables, a word string sj is mapped into a ds,w-dimensional vector, but rather than allocating param-eters for each individual word type, the word vector sj is composed by a series of transformationusing its character sequence sj,0, . . . , sj,x.

* C2W Compositional Model

BLSTM

W h e r e

Word Vector for "Where"

Figure 2: Illustration of the C2W model. Square boxes represent vectors of neuron activations.

The illustration of the model is shown in 2. Essentially, the model builds a representation of the wordusing characters, by reading characters from left to right and vice-versa. More formally, given an in-put word sj = sj,0, . . . , sj,x, the model projects each character into a continuous ds,c-dimensionalvectors sj,0, . . . , sj,x using a character lookup table. Then, it builds a forward LSTM state se-quence hf

0 , . . . ,hfk by reading the character vectors sj,0, . . . , sj,x. Another, backward LSTM reads

the character vectors in the reverse order generating the backward states hbk, . . . ,h

b0. Finally, the

4

open question: should attention be on level of words or subwords?

Rico Sennrich Neural Machine Translation 50 / 65

Page 63: MTMA16-Neural Machine Translation

Subword neural machine translation

Choice of subword unitcharacter-level: small vocabulary, long sequences

morphemes (?): hard to control vocabulary size

hybrid choice: shortlist of words, subwords for rare words

variable-length character n-grams: byte-pair encoding (BPE)

open research question which subword segmentation is best choice interms of efficiency and effectiveness.

Rico Sennrich Neural Machine Translation 51 / 65

Page 64: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’l o w </w>’ : 5’l o w e r </w>’ : 2’n e w e s t </w>’ : 6’w i d e s t </w>’ : 3

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 65: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’l o w </w>’ : 5’l o w e r </w>’ : 2’n e w es t </w>’ : 6’w i d es t </w>’ : 3

(’e’, ’s’) : 9

(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 66: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’l o w </w>’ : 5’l o w e r </w>’ : 2’n e w est </w>’ : 6’w i d est </w>’ : 3

(’e’, ’s’) : 9(’es’, ’t’) : 9

(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 67: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’l o w </w>’ : 5’l o w e r </w>’ : 2’n e w est</w>’ : 6’w i d est</w>’ : 3

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9

(’l’, ’o’) : 7(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 68: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’lo w </w>’ : 5’lo w e r </w>’ : 2’n e w est</w>’ : 6’w i d est</w>’ : 3

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7

(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 69: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

word segmentation with byte-pair encoding [Sennrich et al., 2015b]actually a merge algorithm, starting from characters

iteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: original vocabulary + one symbol per merge

’low </w>’ : 5’low e r </w>’ : 2’n e w est</w>’ : 6’w i d est</w>’ : 3

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7...

Rico Sennrich Neural Machine Translation 52 / 65

Page 70: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’l o w e s t </w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 71: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’l o w es t </w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 72: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’l o w est </w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 73: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’l o w est</w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 74: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’lo w est</w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 75: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

why BPE?good trade-off between vocabulary size and text length

segmentation # tokens # types # UNKnone 100 m 1 750 000 1079characters 550 m 3000 0BPE (60k operations) 112 m 63 000 0

learned operations can be applied to unknown words→ open-vocabulary

’low est</w>’

(’e’, ’s’) : 9(’es’, ’t’) : 9(’est’, ’</w>’) : 9(’l’, ’o’) : 7(’lo’, ’w’) : 7

Rico Sennrich Neural Machine Translation 53 / 65

Page 76: MTMA16-Neural Machine Translation

Byte pair encoding for word segmentation

system sentencesource health research institutesreference Gesundheitsforschungsinstitutecopy unknown words Forschungsinstitutecharacter bigrams Fo|rs|ch|un|gs|in|st|it|ut|io|ne|nBPE (joint vocabulary) Gesundheits|forsch|ungsin|stitute

source asinine situationreference dumme Situationcopy unknown words asinine situation→ UNK→ asininecharacter bigrams as|in|in|e situation→ As|in|en|si|tu|at|io|nBPE (joint vocabulary) as|in|ine situation→ As|in|in-|Situation

Table: English→German translation example. “|” marks subword boundaries.

system sentencesource Mirzayevareference Ìèðçàåâà (Mirzaeva)copy unknown words Mirzayeva → UNK→ Mirzayevacharacter bigrams Mi|rz|ay|ev|a→Ìè|ðç|àå|âà (Mi|rz|ae|va)BPE (joint vocabulary) Mir|za|yeva →Ìèð|çà|åâà (Mir|za|eva)

Table: English→Russian translation example. “|” marks subword boundaries.

Rico Sennrich Neural Machine Translation 54 / 65

Page 77: MTMA16-Neural Machine Translation

Architecture variants

convolution network as encoder [Kalchbrenner and Blunsom, 2013]

TreeLSTM as encoder [Eriguchi et al., 2016]

modifications to attention mechanism[Luong et al., 2015a, Feng et al., 2016, Tu et al., 2016, Mi et al., 2016]→ goal of better modelling distortion, coverage, etc.

reward symmetry between source-to-target and target-to-sourceattention [Cohn et al., 2016, Cheng et al., 2015]

Rico Sennrich Neural Machine Translation 55 / 65

Page 78: MTMA16-Neural Machine Translation

Sequence-level training

problem: at training time, target-side history is reliable;at test time, it is not.

solution: instead of using gold context, sample from the model toobtain target context[Shen et al., 2015, Ranzato et al., 2015, Bengio et al., 2015]

more efficient cross entropy training remains in use to initializeweights

Rico Sennrich Neural Machine Translation 56 / 65

Page 79: MTMA16-Neural Machine Translation

Training data: monolingual

Why train on monolingual data?cheaper to create/collect

parallel data is scarce for many language pairs

domain adaptation with in-domain monolingual data

Rico Sennrich Neural Machine Translation 57 / 65

Page 80: MTMA16-Neural Machine Translation

Training data: monolingual

Solutions/1shallow fusion: rerank with language model [Gülçehre et al., 2015]

deep fusion: extra, LM-specific hidden layer [Gülçehre et al., 2015]

(a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2)

Figure 1: Graphical illustrations of the proposed fusion methods.

learned by the LM from monolingual corpora isnot overwritten. It is possible to use monolingualcorpora as well while finetuning all the parame-ters, but in this paper, we alter only the output pa-rameters in the stage of finetuning.

4.2.1 Balancing the LM and TMIn order for the decoder to flexibly balance the in-put from the LM and TM, we augment the decoderwith a “controller” mechanism. The need to flex-ibly balance the signals arises depending on thework being translated. For instance, in the caseof Zh-En, there are no Chinese words that corre-spond to articles in English, in which case the LMmay be more informative. On the other hand, ifa noun is to be translated, it may be better to ig-nore any signal from the LM, as it may prevent thedecoder from choosing the correct translation. In-tuitively, this mechanism helps the model dynami-cally weight the different models depending on theword being translated.

The controller mechanism is implemented as afunction taking the hidden state of the LM as inputand computing

gt = σ(v>g s

LMt + bg

), (7)

where σ is a logistic sigmoid function. vg and bgare learned parameters.

The output of the controller is then multipliedwith the hidden state of the LM. This lets the de-

coder use the signal from the TM fully, while thecontroller controls the magnitude of the LM sig-nal.

In our experiments, we empirically found that itwas better to initialize the bias bg to a small, neg-ative number. This allows the decoder to decidethe importance of the LM only when it is deemednecessary.

5 Datasets

We evaluate the proposed approaches on four di-verse tasks: Chinese to English (Zh-En), Turkishto English (Tr-En), German to English (De-En)and Czech to English (Cs-En). We describe eachof these datasets in more detail below.

5.1 Parallel Corpora

5.1.1 Zh-En: OpenMT’15We use the parallel corpora made availableas a part of the NIST OpenMT’15 Challenge.Sentence-aligned pairs from three domains arecombined to form a training set: (1) SMS/CHATand (2) conversational telephone speech (CTS)from DARPA BOLT Project, and (3) newsgroup-s/weblogs from DARPA GALE Project. In total,the training set consists of 430K sentence pairs(see Table 1 for the detailed statistics). We train

In all our experiments, we set bg = −1 to ensure thatgt is initially 0.26 on average.

[Gülçehre et al., 2015]

Rico Sennrich Neural Machine Translation 58 / 65

Page 81: MTMA16-Neural Machine Translation

Training data: monolingual

Solutions/2decoder is already a language model. Train encoder-decoder withadded monolingual data [Sennrich et al., 2015a]

ti = tanh(Uosi−1 + VoEyyi−1 + Coci)

yi = softmax(Woti)

how do we get approximation of context vector ci?dummy source context (moderately effective)automatically back-translate monolingual data into source language

name 2014 2015PBSMT [Haddow et al., 2015] 28.8 29.3NMT [Gülçehre et al., 2015] 23.6 -shallow fusion [Gülçehre et al., 2015] 23.7 -deep fusion [Gülçehre et al., 2015] 24.0 -NMT baseline 25.9 26.7+back-translated monolingual data 29.5 30.4

Table: DE→EN translation performance (BLEU) on WMT training/test sets.

Rico Sennrich Neural Machine Translation 59 / 65

Page 82: MTMA16-Neural Machine Translation

Training data: multilingual

Multi-source translation [Zoph and Knight, 2016]we can condition on multiple input sentences

A B C <EOS> W X Y Z

<EOS> Z Y X W

A B C

<EOS> W X Y Z

<EOS> Z Y X W

I J K

Figure 2: Multi-source encoder-decoder model for MT. We have two source sentences (C B A and K J I)in different languages. Each language has its own encoder; it passes its final hidden and cell state to a setof combiners (in black). The output of a combiner is a hidden state and cell state of the same dimension.

the input gate of a typical LSTM cell. In equa-tion 4, there are two forget gates indexed by thesubscript i that serve as the forget gates for eachof the incoming cells for each of the encoders. Inequation 5, o represents the output gate of a nor-mal LSTM. i, f , o, and u are all size-1000 vectors.

2.3 Multi-Source AttentionOur single-source attention model is modeled offthe local-p attention model with feed input fromLuong et al. (2015b), where hidden states from thetop decoder layer can look back at the top hiddenstates from the encoder. The top decoder hiddenstate is combined with a weighted sum of the en-coder hidden states, to make a better hidden statevector (ht), which is passed to the softmax outputlayer. With input-feeding, the hidden state fromthe attention model is sent down to the bottom de-coder layer at the next time step.

The local-p attention model from Luong et al.(2015b) works as follows. First, a position to lookat in the source encoder is predicted by equation 9:

pt = S · sigmoid(vTp tanh(Wpht)) (9)

S is the source sentence length, and vp and Wp

are learned parameters, with vp being a vector ofdimension 1000, and Wp being a matrix of dimen-sion 1000 x 1000.

After pt is computed, a window of size 2D + 1is looked at in the top layer of the source encodercentered around pt (D = 10). For each hiddenstate in this window, we compute an alignment

score at(s), between 0 and 1. This alignment scoreis computed by equations 10, 11 and 12:

at(s) = align(ht, hs)exp(−(s− pt)2

2σ2

)(10)

align(ht, hs) =exp(score(ht, hs))∑s′ exp(score(ht, hs′))

(11)

score(ht, hs) = hTt Wahs (12)

In equation 10, σ is set to be D/2 and s is thesource index for that hidden state. Wa is a learn-able parameter of dimension 1000 x 1000.

Once all of the alignments are calculated, ct iscreated by taking a weighted sum of all source hid-den states multiplied by their alignment weight.

The final hidden state sent to the softmax layeris given by:

ht = tanh(Wc[ht; ct]

)(13)

We modify this attention model to look at bothsource encoders simultaneously. We create a con-text vector from each source encoder named c1tand c2t instead of the just ct in the single-sourceattention model:

ht = tanh(Wc[ht; c

1t ; c

2t ])

(14)

In our multi-source attention model we nowhave two pt variables, one for each source encoder.

benefits:one source text may contain information that is undespecified in other→ possible quality gains

drawbacks:we need multiple source sentences at training and decoding time

Rico Sennrich Neural Machine Translation 60 / 65

Page 83: MTMA16-Neural Machine Translation

Training data: multilingual

Multilingual models [Dong et al., 2015, Firat et al., 2016]we can share layers of the model across language pairsFigure 2: Multi-task learning framework for multiple-target language translation

Figure 3: Optimization for end to multi-end model

3.4 Translation with Beam SearchAlthough parallel corpora are available for theencoder and the decoder modeling in the trainingphrase, the ground truth is not available during testtime. During test time, translation is produced byfinding the most likely sequence via beam search.

Y = argmaxY

p(YTp |STp) (15)

Given the target direction we want to translate to,beam search is performed with the shared encoderand a specific target decoder where search spacebelongs to the decoder Tp. We adopt beam searchalgorithm similar as it is used in SMT system(Koehn, 2004) except that we only utilize scoresproduced by each decoder as features. The sizeof beam is 10 in our experiments for speedupconsideration. Beam search is ended until the end-of-sentence eos symbol is generated.

4 Experiments

We conducted two groups of experiments toshow the effectiveness of our framework. Thegoal of the first experiment is to show thatmulti-task learning helps to improve translationperformance given enough training corpora for alllanguage pairs. In the second experiment, weshow that for some resource-poor language pairswith a few parallel training data, their translationperformance could be improved as well.

4.1 DatasetThe Europarl corpus is a multi-lingual corpusincluding 21 European languages. Here we onlychoose four language pairs for our experiments.The source language is English for all languagepairs. And the target languages are Spanish(Es), French (Fr), Portuguese (Pt) and Dutch(Nl). To demonstrate the validity of ourlearning framework, we do some preprocessingon the training set. For the source language,we use 30k of the most frequent words forsource language vocabulary which is sharedacross different language pairs and 30k mostfrequent words for each target language. Out-of-vocabulary words are denoted as unknownwords, and we maintain different unknown wordlabels for different languages. For test sets,we also restrict all words in the test set tobe from our training vocabulary and mark theOOV words as the corresponding labels as inthe training data. The size of training corpus inexperiment 1 and 2 is listed in Table 1 where

1727

benefits:transfer learning from one language pair to the other→ possible quality gains, especially for low-resourced language pairsscalability: do we need N2 −N independent models for N languages?→ sharing of parameters allows linear growth

drawbacks:generalization to language pairs with no training data untested

Rico Sennrich Neural Machine Translation 61 / 65

Page 84: MTMA16-Neural Machine Translation

Training data: other tasks

Multi-task models [Luong et al., 2016]other tasks can be modelled with sequence-to-sequence models

we can share layers between translation and other tasks

Published as a conference paper at ICLR 2016

Figure 1:Sequence to sequence learning examples – (left) machine translation (Sutskever et al.,2014) and (right) constituent parsing (Vinyals et al., 2015a).

and German by up to +1.5 BLEU points over strong single-task baselines on the WMT benchmarks.Furthermore, we have established a newstate-of-the-artresult in constituent parsing with 93.0 F1.We also explore two unsupervised learning objectives, sequence autoencoders (Dai & Le, 2015) andskip-thought vectors (Kiros et al., 2015), and reveal theirinteresting properties in the MTL setting:autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.

2 SEQUENCE TOSEQUENCE LEARNING

Sequence to sequence learning (seq2seq) aims to directly model the conditional probabilityp(y|x) ofmapping an input sequence,x1, . . . , xn, into an output sequence,y1, . . . , ym. It accomplishes suchgoal through theencoder-decoderframework proposed by Sutskever et al. (2014) and Cho et al.(2014). As illustrated in Figure 1, theencodercomputes a representations for each input sequence.Based on that input representation, thedecodergenerates an output sequence, one unit at a time, andhence, decomposes the conditional probability as:

log p(y|x) =∑m

j=1log p (yj|y<j , x, s) (1)

A natural model for sequential data is the recurrent neural network (RNN), which is used by most ofthe recentseq2seqwork. These work, however, differ in terms of: (a)architecture– from unidirec-tional, to bidirectional, and deep multi-layer RNNs; and (b) RNN type– which are long-short termmemory (LSTM) (Hochreiter & Schmidhuber, 1997) and the gated recurrent unit (Cho et al., 2014).

Another important difference betweenseq2seqwork lies in what constitutes the input represen-tation s. The earlyseq2seqwork (Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015b;Vinyals et al., 2015b) uses only the last encoder state to initialize the decoder and setss = [ ]in Eq. (1). Recently, Bahdanau et al. (2015) proposes anattention mechanism, a way to provideseq2seqmodels with a random access memory, to handle long input sequences. This is accomplishedby settings in Eq. (1) to be the set of encoder hidden states already computed. On the decoder side,at each time step, the attention mechanism will decide how much information to retrieve from thatmemory by learning where to focus, i.e., computing the alignment weights for all input positions.Recent work such as (Xu et al., 2015; Jean et al., 2015a; Luonget al., 2015a; Vinyals et al., 2015a)has found that it is crucial to empowerseq2seqmodels with the attention mechanism.

3 MULTI -TASK SEQUENCE-TO-SEQUENCE LEARNING

We generalize the work of Dong et al. (2015) to the multi-tasksequence-to-sequence learning set-ting that includes the tasks of machine translation (MT), constituency parsing, and image captiongeneration. Depending which tasks involved, we propose to categorize multi-taskseq2seqlearninginto three general settings. In addition, we will discuss the unsupervised learning tasks consideredas well as the learning process.

3.1 ONE-TO-MANY SETTING

This scheme involvesone encoderand multiple decodersfor tasks in which the encoder can beshared, as illustrated in Figure 2. The input to each task is asequence of English words. A separatedecoder is used to generate each sequence of output units which can be either (a) a sequence of tags

2

Rico Sennrich Neural Machine Translation 62 / 65

Page 85: MTMA16-Neural Machine Translation

NMT as a component in log-linear models

Log-linear modelsmodel ensembling is well-established

reranking output of phrase-based/syntax-based with NMT[Neubig et al., 2015]

incorporating NMT as a feature function into PBSMT[Junczys-Dowmunt et al., 2016]→ results depend on relative performance of PBSMT and NMTlog-linear combination of different neural models

left-to-right and right-to-left [Liu et al., 2016]source-to-target and target-to-source [Li and Jurafsky, 2016]

Rico Sennrich Neural Machine Translation 63 / 65

Page 86: MTMA16-Neural Machine Translation

Some future directions for (neural) MT research

(better) solutions to new(ish) problemsOOVs, coverage, efficiency...

new solutions to old problemsconsider context beyond sentence boundaryreward semantic adequacy of translationdeal with underspecified input

Chinese tenseSpanish zero pronounsEnglish politeness

new opportunitiesone model for many language pairs?tight integration with other NLP tasks

Rico Sennrich Neural Machine Translation 64 / 65

Page 87: MTMA16-Neural Machine Translation

Lab session this afternoon

hands-on session with loose guidance

theano (+CUDA) installation session

train your own NMT modelhttps://github.com/rsennrich/wmt16-scripts

try decoding with existing modelhttp://statmt.org/rsennrich/wmt16_systems/

Rico Sennrich Neural Machine Translation 65 / 65

Page 88: MTMA16-Neural Machine Translation

Thank you!

Rico Sennrich Neural Machine Translation 66 / 65

Page 89: MTMA16-Neural Machine Translation

Bibliography I

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.CoRR, abs/1506.03099.

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003).A Neural Probabilistic Language Model.J. Mach. Learn. Res., 3:1137–1155.

Cheng, Y., Shen, S., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2015).Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation.CoRR, abs/1512.04650.

Cho, K. (2015).Natural Language Understanding with Distributed Representation.CoRR, abs/1511.07916.

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014).Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734,Doha, Qatar. Association for Computational Linguistics.

Rico Sennrich Neural Machine Translation 67 / 65

Page 90: MTMA16-Neural Machine Translation

Bibliography II

Chung, J., Cho, K., and Bengio, Y. (2016).A Character-level Decoder without Explicit Segmentation for Neural Machine Translation.CoRR, abs/1603.06147.

Cohn, T., Hoang, C. D. V., Vymolova, E., Yao, K., Dyer, C., and Haffari, G. (2016).Incorporating Structural Alignment Biases into an Attentional Neural Translation Model.CoRR, abs/1601.01085.

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014).Fast and Robust Neural Network Joint Models for Statistical Machine Translation.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1370–1380, Baltimore, Maryland. Association for Computational Linguistics.

Dong, D., Wu, H., He, W., Yu, D., and Wang, H. (2015).Multi-Task Learning for Multiple Language Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1723–1732, Beijing, China. Association for Computational Linguistics.

Eriguchi, A., Hashimoto, K., and Tsuruoka, Y. (2016).Tree-to-Sequence Attentional Neural Machine Translation.ArXiv e-prints.

Feng, S., Liu, S., Li, M., and Zhou, M. (2016).Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model.CoRR, abs/1601.03317.

Rico Sennrich Neural Machine Translation 68 / 65

Page 91: MTMA16-Neural Machine Translation

Bibliography III

Firat, O., Cho, K., and Bengio, Y. (2016).Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism.ArXiv e-prints.

Gülçehre, c., Ahn, S., Nallapati, R., Zhou, B., and Bengio, Y. (2016).Pointing the Unknown Words.CoRR, abs/1603.08148.

Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).On Using Monolingual Corpora in Neural Machine Translation.CoRR, abs/1503.03535.

Haddow, B., Huck, M., Birch, A., Bogoychev, N., and Koehn, P. (2015).The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015.In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 126–133, Lisbon, Portugal. Association forComputational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Junczys-Dowmunt, M., Dwojak, T., and Sennrich, R. (2016).The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions inPhrase-based SMT.In First Conference on Machine Translation (WMT16), Berlin, Germany.

Rico Sennrich Neural Machine Translation 69 / 65

Page 92: MTMA16-Neural Machine Translation

Bibliography IV

Kalchbrenner, N. and Blunsom, P. (2013).Recurrent Continuous Translation Models.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle. Association forComputational Linguistics.

Li, J. and Jurafsky, D. (2016).Mutual Information and Diverse Decoding Improve Neural Machine Translation.CoRR, abs/1601.00372.

Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015).Character-based Neural Machine Translation.ArXiv e-prints.

Liu, L., Utiyama, M., Finch, A., and Sumita, E. (2016).Agreement on Target-bidirectional Neural Machine Translation .In NAACL HLT 16, San Diego, CA.

Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016).Multi-task Sequence to Sequence Learning.In ICLR 2016.

Luong, M.-T. and Manning, C. D. (2016).Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.ArXiv e-prints.

Luong, T., Pham, H., and Manning, C. D. (2015a).Effective Approaches to Attention-based Neural Machine Translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon,Portugal. Association for Computational Linguistics.

Rico Sennrich Neural Machine Translation 70 / 65

Page 93: MTMA16-Neural Machine Translation

Bibliography V

Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015b).Addressing the Rare Word Problem in Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 11–19, Beijing, China. Association for Computational Linguistics.

Mi, H., Sankaran, B., Wang, Z., and Ittycheriah, A. (2016).A Coverage Embedding Model for Neural Machine Translation.ArXiv e-prints.

Neubig, G., Morishita, M., and Nakamura, S. (2015).Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015.In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 35–41, Kyoto, Japan.

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015).Sequence Level Training with Recurrent Neural Networks.CoRR, abs/1511.06732.

Schwenk, H., Dechelotte, D., and Gauvain, J.-L. (2006).Continuous Space Language Models for Statistical Machine Translation.In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 723–730, Sydney, Australia.

Sennrich, R., Haddow, B., and Birch, A. (2015a).Improving Neural Machine Translation Models with Monolingual Data.ArXiv e-prints.

Sennrich, R., Haddow, B., and Birch, A. (2015b).Neural Machine Translation of Rare Words with Subword Units.CoRR, abs/1508.07909.

Rico Sennrich Neural Machine Translation 71 / 65

Page 94: MTMA16-Neural Machine Translation

Bibliography VI

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2015).Minimum Risk Training for Neural Machine Translation.CoRR, abs/1512.02433.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016).Coverage-based Neural Machine Translation.CoRR, abs/1601.04811.

Vaswani, A., Zhao, Y., Fossum, V., and Chiang, D. (2013).Decoding with Large-Scale Neural Language Models Improves Translation.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pages1387–1392, Seattle, Washington, USA.

Zoph, B. and Knight, K. (2016).Multi-Source Neural Translation.In NAACL HLT 2016.

Rico Sennrich Neural Machine Translation 72 / 65

Page 95: MTMA16-Neural Machine Translation

Hands-on session: install theano

theano depends on bleeding-edge numpymy suggestion: to avoid version conflicts, install in Python virtualenvironment

pip install –user virtualenv #install virtualenvvirtualenv virtual_environment #create environemntsource virtual_environment/bin/activate #start environmentpip install numpy numexpr cython tables theano ipdb #install theano

you may need to install BLAS library and other dependencieson Debian Linux:sudo apt-get install liblapack-dev libblas-dev gfortranlibhdf5-serial-dev

if you have NVIDIA GPU, you should install CUDAif you don’t, training will be too slow

Rico Sennrich Neural Machine Translation 73 / 65

Page 96: MTMA16-Neural Machine Translation

Hands-on session: train your own model

sample scripts and config athttps://github.com/rsennrich/wmt16-scripts

requirements:mosesdecoder (just for preprocessing, no installation required)git clone https://github.com/moses-smt/mosesdecodersubword-nmt (for BPE segmentation)git clone https://github.com/rsennrich/subword-nmtNematus (DL4MT fork; for training NMT)git clone https://www.github.com/rsennrich/nematus

set paths in shell scripts, then execute preprocess.sh, then train.sh

to train actual model, use more data, and be patientscript prints status after every 1000 minibatches→≈ 30 min if CUDA is set up properly(we train for 300000–600000 minibatches)

Rico Sennrich Neural Machine Translation 74 / 65

Page 97: MTMA16-Neural Machine Translation

Hands-on session: use pre-trained model

download model(s) fromhttp://statmt.org/rsennrich/wmt16_systems/

requirements:mosesdecoder (just for preprocessing, no installation required)git clone https://github.com/moses-smt/mosesdecodersubword-nmt (for BPE segmentation)git clone https://github.com/rsennrich/subword-nmtNematus (DL4MT fork; for decoding)git clone https://www.github.com/rsennrich/nematus

set paths in translate.sh, then execute:echo "This is a test." | ./translate.sh

Rico Sennrich Neural Machine Translation 75 / 65