Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...
Transcript of Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf ·...
![Page 1: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/1.jpg)
Machine Translation04: Neural Machine Translation
Rico Sennrich
University of Edinburgh
R. Sennrich MT – 2018 – 04 1 / 20
![Page 2: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/2.jpg)
Overview
last lecturehow do we represent language in neural networks?
how do we treat language probabilistically (with neural networks)?
today’s lecturehow do we model translation with a neural network?
how do we generate text from a probabilistic translation model?
R. Sennrich MT – 2018 – 04 1 / 20
![Page 3: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/3.jpg)
Modelling Translation
Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)
We can express translation as a probabilistic model
T ∗ = argmaxT
p(T |S)
Expanding using the chain rule gives
p(T |S) = p(y1, . . . , yn|x1, . . . , xm)
=
n∏i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
R. Sennrich MT – 2018 – 04 2 / 20
![Page 4: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/4.jpg)
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich MT – 2018 – 04 3 / 20
![Page 5: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/5.jpg)
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich MT – 2018 – 04 3 / 20
![Page 6: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/6.jpg)
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich MT – 2018 – 04 4 / 20
![Page 7: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/7.jpg)
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich MT – 2018 – 04 4 / 20
![Page 8: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/8.jpg)
Summary vector
Last encoder hidden-state “summarises” source sentence
With multilingual training, we can potentially learnlanguage-independent meaning representation
[Sutskever et al., 2014]
R. Sennrich MT – 2018 – 04 5 / 20
![Page 9: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/9.jpg)
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich MT – 2018 – 04 6 / 20
![Page 10: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/10.jpg)
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich MT – 2018 – 04 6 / 20
![Page 11: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/11.jpg)
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.70.1
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
![Page 12: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/12.jpg)
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.60.2
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
![Page 13: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/13.jpg)
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.70.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
![Page 14: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/14.jpg)
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.7
0.10.1
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
![Page 15: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/15.jpg)
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.10.7
Decoder
Encoder
R. Sennrich MT – 2018 – 04 7 / 20
![Page 16: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/16.jpg)
Attentional encoder-decoder: Maths
simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU
simpler output layer
we do not show bias terms
decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]
Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb
notationW , U , E, C, V are weight matrices (of different dimensionality)
E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)
separate weight matrices for encoder and decoder (e.g. Ex and Ey)
input X of length Tx; output Y of length TyR. Sennrich MT – 2018 – 04 8 / 20
![Page 17: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/17.jpg)
Attentional encoder-decoder: Maths
encoder
−→h j =
{0, , if j = 0
tanh(−→W xExxj +
−→U xhj−1) , if j > 0
←−h j =
{0, , if j = Tx + 1
tanh(←−W xExxj +
←−U xhj+1) , if j ≤ Tx
hj = (−→h j ,←−h j)
R. Sennrich MT – 2018 – 04 9 / 20
![Page 18: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/18.jpg)
Attentional encoder-decoder: Maths
decoder
si =
{tanh(Ws
←−h i), , if i = 0
tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0
ti = tanh(Uosi +WoEyyi−1 + Coci)
yi = softmax(Voti)
attention model
eij = v>a tanh(Wasi−1 + Uahj)
αij = softmax(eij)
ci =
Tx∑j=1
αijhj
R. Sennrich MT – 2018 – 04 10 / 20
![Page 19: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/19.jpg)
Attention model
attention modelside effect: we obtain alignment between source and target sentence
information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:
visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...
Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
R. Sennrich MT – 2018 – 04 11 / 20
![Page 20: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/20.jpg)
Attention model
attention model also works with images:
[Cho et al., 2015]
R. Sennrich MT – 2018 – 04 12 / 20
![Page 21: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/21.jpg)
Attention model
[Cho et al., 2015]
R. Sennrich MT – 2018 – 04 13 / 20
![Page 22: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/22.jpg)
Application of Encoder-Decoder Model
Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?
Decoding ( a source sentence)Generate the most probable translation of a source sentence
y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)
R. Sennrich MT – 2018 – 04 14 / 20
![Page 23: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/23.jpg)
Decoding
exact searchgenerate every possible sentence T in target language
compute score p(T |S) for each
pick best one
intractable: |vocab|N translations for output length N→ we need approximative search strategy
R. Sennrich MT – 2018 – 04 15 / 20
![Page 24: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/24.jpg)
Decoding
approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)
select yi according to some heuristic:
sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)
continue until we generate <eos>
! 0.928
0.175
<eos> 0.999
0.175
hello 0.946
0.056
world 0.957
0.100
0
efficient, but suboptimal
R. Sennrich MT – 2018 – 04 16 / 20
![Page 25: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/25.jpg)
Decoding
approximative search/2: beamsearch
maintain list of K hypotheses(beam)
at each time step, expand eachhypothesis k: p(yki |S, yk<i)
select K hypotheses withhighest total probability:∏
i
p(yki |S, yk<i)
hello 0.946
0.056
world 0.957
0.100
World 0.010
4.632
. 0.030
3.609
! 0.928
0.175
... 0.014
4.384
<eos> 0.999
3.609
world 0.684
5.299
HI 0.007
4.920
<eos> 0.994
4.390
Hey 0.006
5.107
<eos> 0.999
0.175
0
K = 3
relatively efficient . . . beam expansion parallelisable
currently default search strategy in neural machine translation
small beam (K ≈ 10) offers good speed-quality trade-off
R. Sennrich MT – 2018 – 04 17 / 20
![Page 26: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/26.jpg)
Ensembles
combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:
base classifiers are accuratebase classifiers are diverse (make different errors)
R. Sennrich MT – 2018 – 04 18 / 20
![Page 27: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/27.jpg)
Ensembles in NMT
vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)
voting mechanism: typically average (log-)probability
logP (yi|S, y<i) =
∑Mm=1 logPm(yi|S, y<i)
M
requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different
we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding
R. Sennrich MT – 2018 – 04 19 / 20
![Page 28: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/28.jpg)
Further Reading
Required ReadingKoehn, 13.5
Optional ReadingSequence to Sequence Learning with Neural Networks. (Sutskever,Vinyals, Le) :https://papers.nips.cc/paper/
5346-sequence-to-sequence-learning-with-neural-networks.pdf
Neural Machine Translation by Jointly Learning to Align and Translate. (Bahdanau, Cho,Bengio) :https://arxiv.org/pdf/1409.0473.pdf
R. Sennrich MT – 2018 – 04 20 / 20
![Page 29: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/29.jpg)
Bibliography I
Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).
Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.
Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.
R. Sennrich MT – 2018 – 04 21 / 20
![Page 30: Machine Translation - 04: Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/mt18/4.pdf · R. Sennrich MT – 2018 – 04 12/20. Attention model [Cho et al., 2015] R. Sennrich](https://reader033.fdocuments.us/reader033/viewer/2022052100/603a4d2f2fca99785a286177/html5/thumbnails/30.jpg)
Bibliography II
Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.
R. Sennrich MT – 2018 – 04 22 / 20