Revisiting Challenges in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/cwmt.pdfthe test...

Revisiting Challenges in Neural Machine Translation

Rico Sennrich

University of Edinburgh

Rico Sennrich Revisiting Challenges in NMT 1 / 49

Why Revisit Challenges Regularly?

guide research directions NMT factshave expiration date

signpost: Ian Harding (CC BY-NC-SA 2.0)


Revisiting Challenges in NMT

1 Some Challenges in Neural MTLong SentencesAdequacyLow-Resource Translation

2 Challenges RevisitedLong SentencesAdequacyLow-Resource Translation

3 Future Challenges


Encoder–Decoder Has Information Bottleneck

[Cho et al., 2014]


Attention Brings Improvement

[Bahdanau et al., 2015]


Still, Poor Performance Reported for Long Sentences

0 10 20 30 40 50 60 70 8025

30

35

27.1

28.5

29.6

31

33

34.7

34.1

31.3

27.726.9

27.6

28.7

30.3

32.3

33.8

34.7

31.5

33.9

Sentence Length (source, subword count)

BL

EU

BLEU Scores with Varying Sentence Length

NeuralPhrase-Based

Figure 7: Quality of translations based on sen-tence length. SMT outperforms NMT for sen-tences longer than 60 subword tokens. For verylong sentences (80+) quality is much worse due totoo short output.

3.5 Word Alignment

The key contribution of the attention model in neu-ral machine translation (Bahdanau et al., 2015)was the imposition of an alignment of the outputwords to the input words. This takes the shapeof a probability distribution over the input wordswhich is used to weigh them in a bag-of-wordsrepresentation of the input sentence.

Arguably, this attention model does not func-tionally play the role of a word alignment betweenthe source in the target, at least not in the sameway as its analog in statistical machine translation.While in both cases, alignment is a latent variablethat is used to obtain probability distributions overwords or phrases, arguably the attention model hasa broader role. For instance, when translating averb, attention may also be paid to its subject andobject since these may disambiguate it. To fur-ther complicate matters, the word representationsare products of bidirectional gated recurrent neu-ral networks that have the effect that each wordrepresentation is informed by the entire sentencecontext.

But there is a clear need for an alignment mech-anism between source and target words. For in-stance, prior work used the alignments providedby the attention model to interpolate word transla-tion decisions with traditional probabilistic dictio-naries (Arthur et al., 2016), for the introduction ofcoverage and fertility models (Tu et al., 2016), etc.

But is the attention model in fact the proper

rela

tions

betw

een

Oba

ma

and

Net

anya

hu

have

been

stra

ined

for

year

s.

dieBeziehungen

zwischen

Obama

undNetanjahu

sind

seit

Jahrenangespannt

.

56

89

72

16

26

96

79

98

42

11

11

14

38

22

84

23

54 10

98

49

Figure 8: Word alignment for English–German:comparing the attention model states (green boxeswith probability in percent if over 10) with align-ments obtained from fast-align (blue outlines).

means? To examine this, we compare the softalignment matrix (the sequence of attention vec-tors) with word alignments obtained by traditionalword alignment methods. We use incrementalfast-align (Dyer et al., 2013) to align the input andoutput of the neural machine system.

See Figure 8 for an illustration. We comparethe word attention states (green boxes) with theword alignments obtained with fast align (blueoutlines). For most words, these match up prettywell. Both attention states and fast-align align-ment points are a bit fuzzy around the functionwords have-been/sind.

However, the attention model may settle onalignments that do not correspond with our intu-ition or alignment points obtained with fast-align.See Figure 9 for the reverse language direction,German–English. All the alignment points appearto be off by one position. We are not aware of anyintuitive explanation for this divergent behavior —the translation quality is high for both systems.

We measure how well the soft alignment (atten-tion model) of the NMT system match the align-ments of fast-align with two metrics:

• a match score that checks for each outputif the aligned input word according to fast-align is indeed the input word that receivedthe highest attention probability, and

• a probability mass score that sums up the

34

[Koehn and Knowles, 2017]





3 Future Challenges


Adequacy vs. Fluency in WMT16 Evaluation

Adequacy Fluency+1% +13%

CS→EN DE→EN RO→EN RU→EN

60

80

100

70.8 72.2 73.9 72.875.4 75.8

71.2 71.1

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

60

80

100

64.668.4 66.7 67.8

78.7 77.571.9 74.3

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results


Human Evaluation in TraMOOC [Castilho et al., 2018]

comparison of NMT and PBSMT for EN→{DE,EL,PT,RU}direct assessment:

NMT obtains higher fluency judgment than PBSMT: +10%NMT only obtains small improvement in adequacy judgment: +1%

Error Annotationcategory SMT NMT differenceinflectional morphology 2274 1799 -21%word order 1098 691 -37%omission 421 362 -14%addition 314 265 -16%mistranslation 1593 1552 -3%"no issue" 449 788 +75%





3 Future Challenges


Low-Resource Translation

106 107 1080

10

20

30

21.823.4

24.926.2 26.9

27.9 28.6 29.2 29.630.1 30.4

16.418.1

19.621.2

22.223.5

24.726.1 26.9

27.8 28.6

1.6

7.2

11.914.7

18.2

22.4

25.727.4 29.2

30.3 31.1

Corpus Size (English Words)

BLEU Scores with Varying Amounts of Training Data

Phrase-Based with Big LMPhrase-Based

Neural

Figure 3: BLEU scores for English-Spanish sys-tems trained on 0.4 million to 385.7 millionwords of parallel data. Quality for NMT startsmuch lower, outperforms SMT at about 15 mil-lion words, and even beats a SMT system with abig 2 billion word in-domain language model un-der high-resource conditions.

How do the data needs of SMT and NMT com-pare? NMT promises both to generalize better (ex-ploiting word similary in embeddings) and condi-tion on larger context (entire input and all prioroutput words).

We built English-Spanish systems on WMTdata,7 about 385.7 million English words pairedwith Spanish. To obtain a learning curve, we used

11024 , 1

512 , ..., 12 , and all of the data. For SMT, the

language model was trained on the Spanish part ofeach subset, respectively. In addition to a NMTand SMT system trained on each subset, we alsoused all additionally provided monolingual datafor a big language model in contrastive SMT sys-tems.

Results are shown in Figure 3. NMT ex-hibits a much steeper learning curve, starting withabysmal results (BLEU score of 1.6 vs. 16.4 for

11024 of the data), outperforming SMT 25.7 vs.24.7 with 1

16 of the data (24.1 million words), andeven beating the SMT system with a big languagemodel with the full data set (31.1 for NMT, 28.4for SMT, 30.4 for SMT+BigLM).

7Spanish was last represented in 2013, we used data fromhttp://statmt.org/wmt13/translation-task.html

Src: A Republican strategy to counter the re-electionof Obama

11024

Un organo de coordinacion para el anuncio delibre determinacion

1512

Lista de una estrategia para luchar contra laeleccion de hojas de Ohio

1256

Explosion realiza una estrategia divisiva deluchar contra las elecciones de autor

1128

Una estrategia republicana para la eliminacionde la reeleccion de Obama

164

Estrategia siria para contrarrestar la reelecciondel Obama .

132

+ Una estrategia republicana para contrarrestar lareeleccion de Obama

Figure 4: Translations of the first sentence ofthe test set using NMT system trained on varyingamounts of training data. Under low resource con-ditions, NMT produces fluent output unrelated tothe input.

The contrast between the NMT and SMT learn-ing curves is quite striking. While NMT is able toexploit increasing amounts of training data moreeffectively, it is unable to get off the ground withtraining corpus sizes of a few million words orless.

To illustrate this, see Figure 4. With 11024 of the

training data, the output is completely unrelated tothe input, some key words are properly translatedwith 1

512 and 1256 of the data (estrategia for strat-

egy, eleccion or elecciones for election), and start-ing with 1

64 the translations become respectable.

3.3 Rare Words

Conventional wisdom states that neural machinetranslation models perform particularly poorly onrare words, (Luong et al., 2015; Sennrich et al.,2016b; Arthur et al., 2016) due in part to thesmaller vocabularies used by NMT systems. Weexamine this claim by comparing performance onrare word translation between NMT and SMTsystems of similar quality for German–Englishand find that NMT systems actually outperformSMT systems on translation of very infrequentwords. However, both NMT and SMT systemsdo continue to have difficulty translating someinfrequent words, particularly those belonging tohighly-inflected categories.

For the neural machine translation model, weuse a publicly available model8 with the trainingsettings of Edinburgh’s WMT submission (Sen-nrich et al., 2016a). This was trained using Ne-

8https://github.com/rsennrich/wmt16-scripts/

31






3 Future Challenges


Why Are Long Sentences Hard?

different answerstraining on long sentences efficiently is challenging→ training–test mismatch

locally normalized models have bias towards low-entropy states→ outputs too short (</s>)

long-distance interactions may be challenging due to network pathlength (vanishing gradient)

...?


Long Sentences: Training–Test Mismatch

0 10 20 30 40 50 60 70 8025

30

35

27.1

28.5

29.6

31

33

34.7

34.1

31.3

27.726.9

27.6

28.7

30.3

32.3

33.8

34.7

31.5

33.9

Sentence Length (source, subword count)

BL

EU

BLEU Scores with Varying Sentence Length

NeuralPhrase-Based

Figure 7: Quality of translations based on sen-tence length. SMT outperforms NMT for sen-tences longer than 60 subword tokens. For verylong sentences (80+) quality is much worse due totoo short output.

3.5 Word Alignment

The key contribution of the attention model in neu-ral machine translation (Bahdanau et al., 2015)was the imposition of an alignment of the outputwords to the input words. This takes the shapeof a probability distribution over the input wordswhich is used to weigh them in a bag-of-wordsrepresentation of the input sentence.

Arguably, this attention model does not func-tionally play the role of a word alignment betweenthe source in the target, at least not in the sameway as its analog in statistical machine translation.While in both cases, alignment is a latent variablethat is used to obtain probability distributions overwords or phrases, arguably the attention model hasa broader role. For instance, when translating averb, attention may also be paid to its subject andobject since these may disambiguate it. To fur-ther complicate matters, the word representationsare products of bidirectional gated recurrent neu-ral networks that have the effect that each wordrepresentation is informed by the entire sentencecontext.

But there is a clear need for an alignment mech-anism between source and target words. For in-stance, prior work used the alignments providedby the attention model to interpolate word transla-tion decisions with traditional probabilistic dictio-naries (Arthur et al., 2016), for the introduction ofcoverage and fertility models (Tu et al., 2016), etc.

But is the attention model in fact the proper

rela

tions

betw

een

Oba

ma

and

Net

anya

hu

have

been

stra

ined

for

year

s

.

dieBeziehungen

zwischen

Obama

undNetanjahu

sind

seit

Jahrenangespannt

.

56

89

72

16

26

96

79

98

42

11

11

14

38

22

84

23

54 10

98

49

Figure 8: Word alignment for English–German:comparing the attention model states (green boxeswith probability in percent if over 10) with align-ments obtained from fast-align (blue outlines).

means? To examine this, we compare the softalignment matrix (the sequence of attention vec-tors) with word alignments obtained by traditionalword alignment methods. We use incrementalfast-align (Dyer et al., 2013) to align the input andoutput of the neural machine system.

See Figure 8 for an illustration. We comparethe word attention states (green boxes) with theword alignments obtained with fast align (blueoutlines). For most words, these match up prettywell. Both attention states and fast-align align-ment points are a bit fuzzy around the functionwords have-been/sind.

However, the attention model may settle onalignments that do not correspond with our intu-ition or alignment points obtained with fast-align.See Figure 9 for the reverse language direction,German–English. All the alignment points appearto be off by one position. We are not aware of anyintuitive explanation for this divergent behavior —the translation quality is high for both systems.

We measure how well the soft alignment (atten-tion model) of the NMT system match the align-ments of fast-align with two metrics:

• a match score that checks for each outputif the aligned input word according to fast-align is indeed the input word that receivedthe highest attention probability, and

• a probability mass score that sums up the

34


this is uedin-2016 system, trained with a maximum length of 50 subwords!


How to Train on Long Sentences

Problem: time and memory increases with longest sentence in batch

Solutionssort sentences of same length together [Sutskever et al., 2014]

adjust batch size depending on length [Johansen et al., 2016]

layer no. unitsinput alphabet size (X) 300embedding sizes 256char RNN (forward) 400char RNN (backward) 400attention 300char decoder 400target alphabet size (T ) 300

Table 1. Hyperparameter values used for training thechar-to-char model. Where Σsrc and Σtrg represent thenumber of classes in the source and target languages, respec-tively.

layer no. unitsinput alphabet size (X) 300embedding sizes 256char RNN (forward) 400spaces RNN (forward) 400spaces RNN (backward) 400attention 300char decoder 400target alphabet size (T ) 300

Table 2. Hyperparameter values used for training thechar2word-to-charmodel. Where Σsrc and Σtrg repre-sent the number of classes in the source and target languages,respectively.

timesteps, which can result in a lot of wasted resources [Han-nun et al., 2014] (see figure 4). Training translation modelsis further complicated by the fact that source and target sen-tences, while correlated, may have different lengths, and it isnecessary to consider both when constructing batches in orderto utilize computation power and RAM optimally.

To circumvent this issue, we start each epoch by shufflingall samples in the dataset and sorting them with a stable sort-ing algorithm according to both the source and target sentencelengths. This ensures that any two samples in the dataset thathave almost the same source and target sentence lengths arelocated close to each other in the sorted list while the exactorder of samples varies between epochs. To pack a batch wesimply started adding samples from the sorted sample list tothe batch, until we reached the maximal total allowed charac-ter threshold (which we set to 50,000) for the full batch withpadding after which we would start on a new batch. Finallyall the batches are fed in random order to the model for train-ing until all samples have been trained on, and a new epochbegins. Figure 5 illustrates what such dynamic batches mightlook like.

Fig. 4. A regular batch with random samples.

Fig. 5. Our dynamic batches of variable batch size and se-quence length.

5.3. Results

5.3.1. Quantitative

The quantitative results of our models are illustrated in ta-ble 3. Notice that the char2word-to-char model out-performs the char-to-charmodel on all datasets (average1.28 BLEU performance increase). This could be an indica-tion that either having hierarchical, word-like, representationson the encoder or simply the fact that the encoder was signifi-cantly smaller, helps in NMT when using a character decoderwith attention.

[Johansen et al., 2016]


Training on Long Sentences Matters

[1,1

0)

[11,

20)

[21,

30)

[31,

40)

[41,

50)

[51,

60)

[61,

70)

[71,

80)

[81,

90)

[91,∞

)10

20

30

40

source sentence length

BLE

U

max len 50max len 200

BLEU score by source sentence length (number of subwords)


Long Sentences: Label Bias

locally normalized models have label bias [Murray and Chiang, 2018]

log-

prob

abili

ty

decoding timestep

0−1−2−3−4−5−6−7−8−9−10−11−12−13−14

</s>

The

British

wom

en

won Olymp

ic gold in

Figure 2: A locally normalized model must determine,at each time step, a “budget” for the total remaininglog-probability. In this example sentence, “The Britishwomen won Olymp ic gold in p airs row ing,” the emptytranslation has initial position 622 in the beam. Alreadyby the third step of decoding, the correct translationhas a lower score than the empty translation. However,using greedy search, a nonempty translation would bereturned.

3 Correcting Length

To address the brevity problem, many designers ofNMT systems add corrections to the model. Thesecorrections are often presented as modifications tothe search procedure. But, in our view, the brevityproblem is essentially a modeling problem, andthese corrections should be seen as modificationsto the model (Section 3.1). Furthermore, sincethe root of the problem is local normalization, ourview is that these modifications should be trainedas globally-normalized models (Section 3.2).

3.1 ModelsWithout any length correction, the standard modelscore (higher is better) is:

s(e) =

m∑

i=1

log P(ei | e1:i).

To our knowledge, there are three methods incommon use for adjusting the model to favorlonger sentences.

Length normalization divides the score by m(Koehn and Knowles, 2017; Jean et al., 2015;Boulanger-Lewandowski et al., 2013):

s′(e) = s(e) / m.

Google’s NMT system (Wu et al., 2016) relieson a more complicated correction:

s′(e) = s(e)/ (5 + m)α

(5 + 1)α.

Finally, some systems add a constant word re-ward (He et al., 2016):

s′(e) = s(e) + γm.

If γ = 0, this reduces to the baseline model. Theadvantage of this simple reward is that it can becomputed on partial translations, making it easierto integrate into beam search.

3.2 TrainingAll of the above modifications can be viewed asmodifications to the base model so that it is nolonger a locally-normalized probability model.

To train this model, in principle, we should usesomething like the globally-normalized negativelog-likelihood:

L = − logexp s′(e∗)∑e exp s′(e)

where e∗ is the reference translation. However, op-timizing this is expensive, as it requires perform-ing inference on every training example or heuris-tic approximations (Andor et al., 2016; Shen et al.,2016).

Alternatively, we can adopt a two-tiered model,familiar from phrase-based translation (Och andNey, 2002), first training s and then training s′

while keeping the parameters of s fixed, possiblyon a smaller dataset. A variety of methods, likeminimum error rate training (Och, 2003; He et al.,2016), are possible, but keeping with the globally-normalized negative log-likelihood, we obtain, forthe constant word reward, the gradient:

∂L∂γ

= −|e∗| + E[|e|].

If we approximate the expectation using the modeof the distribution, we get

∂L∂γ≈ −|e∗| + |e|

where e is the 1-best translation. Then the stochas-tic gradient descent update is just the familiar per-ceptron rule:

γ ← γ + η (|e∗| − |e|),

214

→ (tunable) length penalty and variants result in simple globallynormalized model→ other methods to escape local normalization include reconstruction[Tu et al., 2017]


Long-Distance Interactions

does network architecture affect learning of long-distance dependencies?

architectures

Figure 1: The Transformer - model architecture.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension dk, and values of dimension dv . We compute the dot products of thequery with all keys, divide each by

√dk, and apply a softmax function to obtain the weights on the

values.

3

RNN/GRU/LSTM convolution self-attention[Gehring et al., 2017] [Vaswani et al., 2017]


Long-Distance Interactions: Targeted Evaluation

evaluation with contrastive pairs: LingEval97 [Sennrich, EACL 2017]

sentence prob.English [...] that the plan will be approvedGerman (correct) [...], dass der Plan verabschiedet wird 0.1 3German (contrastive) * [...], dass der Plan verabschiedet werden 0.01

subject-verb agreement


Long-Distance Interactions: RNN vs. GRU vs. LSTM

EN→DE WMT systems trained with Nematus

targeted evaluation of subject-verb agreement with Lingeval97

0 4 8 12 160.5

0.6

0.7

0.8

0.9

1

≥ 16

distance

accu

racy

LSTMGRURNN

GRU/LSTM much more stable than RNN for long distances


Evaluating NMT Architectures[Tang, Müller, Rios, Sennrich, EMNLP 2018]

EN→DE WMT systems trained with Sockeyetargeted evaluation of subject-verb agreement with Lingeval97

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15

Accuracy

Distance

RNNS2S

RNN-bideep

ConvS2S

Transformer

no evidence that Transformer or ConvS2S outperform LSTM forlong-distance interactions


Long Sentences: Conclusions

strongest evidence for weakness of NMT on long sentences comesfrom old systems

discarding long sentences no longer necessary in NMT trainingBLEU does not tell us why a system performs poorly on longsentences

are translations too short?→ train on long sentences; use global scoresis grammaticality poor?→ architectures matter, but long-distance interactions modelled well byGRU/LSTM and Transformer





3 Future Challenges


Targeted Analysis: Word Sense Disambiguation

system sentencesource Dort wurde er von dem Schläger und einer weiteren männl. Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.our NMT There he was attacked again by the racket and another male person.Google There he was again attacked by the bat and another male person.

Schläger

attackerracket bat

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

bat1 www.personalcreations.com / CC-BY-2.0bat2 Hasitha Tudugalle https://commons.wikimedia.org/wiki/File:Flying-Fox-Bat.jpg /

CC-BY-4.0




Schläger

attacker

racket bat



CC-BY-4.0




Schläger

attackerracket

bat



CC-BY-4.0




Schläger

attackerracket bat



CC-BY-4.0


Word Sense Disambiguation[Rios, Mascarell, Sennrich, WMT 2017]

test set (ContraWSD)35 ambiguous German nouns

2–4 senses per source noun

contrastive translation sets (1 or more contrastive translations)

≈ 100 test instances per sense→≈ 7000 test instances

source: Also nahm ich meinen amerikanischen Reisepassund stellte mich in die Schlange für Extranjeros.

reference: So I took my U.S. passport and got in the line for Extranjeros.

contrastive: So I took my U.S. passport and got in the snake for Extranjeros.contrastive: So I took my U.S. passport and got in the serpent for Extranjeros.


Word Sense Accuracy

100 102 1040

0.2

0.4

0.6

0.8

1

absolute frequency

accu

racy

10−3 10−2 10−1 1000

0.2

0.4

0.6

0.8

1

relative frequency

accu

racy

WSD is challenging, especially for rare word senses


Word Sense Disambiguation:Measuring Progress[Rios, Müller, Sennrich, WMT 2018]

based on ContraWSD, but semi-automatic evaluation of 1-best output

evaluating all WMT 2018 submissions, plus systems from previousyears


Results: Word Sense Disambiguation (uedin systems)

syntax-2016 nmt-2016 nmt-2017 nmt-2018

40

60

80

100

81.3 81.186.3

88.8

accu

racy

(%)

improvements to NMT system2016: shallow RNN

2017: deep RNN; layer normalization; better ensembles; slightly moretraining data

2018: Transformer; more (noisy) training data


Results: Word Sense Disambiguation (selected systems)

RWTH-unsup online-F uedin-2016 uedin-2018 RWTH

40

60

80

100

47.251.4

81.1

88.893.6

accu

racy

(%)

WSD is big challenge for unsupervised NMT and rule-based system

all neural systems at WMT18 > 81%

big reduction in WSD errors in last 2 years



comparing different architectures on same dataset

Transformer no better than RNN at long-distance agreement

interesting differences for word sense disambiguation:

RNN CNN Transformer

0

10

20

30

40

30.1 30.433.7

BLE

U(n

ewst

est2

017)

RNN CNN Transformer

60

70

80

90

100

8482.3

90.3

Con

traW

SD

accu

racy

(%)



post-publication experiments:models can be made more similar to Transformer with:

multihead attentionfeedforward blocklayer normalization...

Transformer still ahead in WSD accuracy

RNN CNN Transformer

0

10

20

30

40

32.2 32.2 33.3

BLE

U(n

ewst

est2

017)

RNN CNN Transformer

60

70

80

90

100

88.1 87.289.1

Con

traW

SD

accu

racy

(%)





3 Future Challenges



106 107 1080

10

20

30

21.823.4

24.926.2 26.9

27.9 28.6 29.2 29.630.1 30.4

16.418.1

19.621.2

22.223.5

24.726.1 26.9

27.8 28.6

1.6

7.2

11.914.7

18.2

22.4

25.727.4 29.2

30.3 31.1




Neural




11024 , 1








11024


1512


1256


1128


164


132









3.3 Rare Words




31




aspects worth revisiting:

phrase-based benefit from large LMbut NMT can also improve with monolingual data

there were general improvements in NMTdo they move the point where NMT outperforms SMT?

the NMT system was not optimized for low-resource NMTdoes tuning model to low-resource NMT help?


Low-Resource Translation: Monolingual Data

There is a large pool of methods to exploit monolingual data for NMT:

ensembling with LM [Gülçehre et al., 2015]

training objective: language modelling [Sennrich et al., 2016, Ramachandran et al., 2016]

training objective: autoencoders [Luong et al., 2016, Currey et al., 2017]

training objective: round-trip translation[Sennrich et al., 2016, He et al., 2016, Cheng et al., 2016]

unsupervised NMT [Artetxe et al., 2017, Lample et al., 2017]

similarly, parallel data from other language pairs can help[Zoph et al., 2016, Chen et al., 2017, Nguyen and Chiang, 2017]

..but how far can we get with just parallel data?


Low-Resource Translation: Experiments

setupIWSLT14 German→English:

full set: 160 000 sentences (3.2M words)smallest subset: 5000 sentences (100 000 words)

phrase-based SMT with Moses

neural MT with Nematus and BPE

baseline: hyperparameters similar to uedin@WMT16shallow RNN, no dropout

comparison to [Koehn and Knowles, 2017][Koehn and Knowles, 2017]:0.4 million to 385 million words of data (EN→ES WMT)

our experiments:0.1 million to 3.2 million words of data (DE→EN IWSLT)


Low-Resource Translation: Experiments

general architecture improvements:bideep RNN [Miceli Barone et al., 2017]

layer normalization [Ba et al., 2016]

label smoothing [Szegedy et al., 2016]

choices optimized for low-resource scenario:dropout [Srivastava et al., 2014]

tied embeddings [Press and Wolf, 2017]

smaller BPE vocabulary size for smaller data setssmaller batch size for smaller data setslexical model [Nguyen and Chiang, 2018]


Low-Resource Translation: Results

[Koehn and Knowles, 2017] our experiments

106 107 1080

10

20

30

21.823.4

24.926.2 26.9

27.9 28.6 29.2 29.630.1 30.4

16.418.1

19.621.2

22.223.5

24.726.1 26.9

27.8 28.6

1.6

7.2

11.914.7

18.2

22.4

25.727.4 29.2

30.3 31.1




Neural




11024 , 1








11024


1512


1256


1128


164


132









3.3 Rare Words




31

106

107

108

0

10

20

30

24.323.1

20.6

18.7

23

16

10.6

1.2

corpus size (English words)B

LEU

phrase-based SMTneural MT baseline



[Koehn and Knowles, 2017] our experiments

106 107 1080

10

20

30

21.823.4

24.926.2 26.9

27.9 28.6 29.2 29.630.1 30.4

16.418.1

19.621.2

22.223.5

24.726.1 26.9

27.8 28.6

1.6

7.2

11.914.7

18.2

22.4

25.727.4 29.2

30.3 31.1




Neural




11024 , 1








11024


1512


1256


1128


164


132









3.3 Rare Words




31

106

107

108

0

10

20

3029.5

26.9

24

19.4

24.323.1

20.6

18.7

23

16

10.6

1.2

corpus size (English words)B

LEU

neural MT optimizedphrase-based SMTneural MT baseline



105

106

0

10

20

3029.5

26.9

24

19.4

15.4

11

24.323.1

20.6

18.7

16.5

14.1

23

16

10.6

1.20.80

corpus size (English words)

BLE

U

neural MT optimizedphrase-based SMTneural MT baseline



BLEU bycorpus size (words)

system 100k 3.2M

phrase-based SMT 14.1 24.3NMT baseline 0.0 23.0+dropout, tied embeddings, layer normalization,

6.0 30.0bideep RNN, label smoothing

+reduce BPE vocabulary (14k→ 2k symbols) 9.3 -+reduce batch size (4k→ 1k tokens) 9.7 29.9+lexical model 11.0 29.5


Low-Resource Translation: Conclusions

the balance between PBSMT and NMT for low-resource settings isshifting with

general improvements in NMTcareful choice of hyperparameters and architectures for low-resourcesetting

it is no longer true that we cannot train NMT on less than 1M words...

...but low-resource machine translation remains a challenge





3 Future Challenges


Are There Any Challenges Left?

Google’s Neural Machine Translation System: Bridging the Gapbetween Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouziyonghui,schuster,zhifengc,qvl,[email protected]

Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser,

Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens,George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa,

Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean

AbstractNeural Machine Translation (NMT) is an end-to-end learning approach for automated translation,

with the potential to overcome many of the weaknesses of conventional phrase-based translation systems.Unfortunately, NMT systems are known to be computationally expensive both in training and in translationinference – sometimes prohibitively so in the case of very large data sets and large models. Several authorshave also charged that NMT systems lack robustness, particularly when input sentences contain rare words.These issues have hindered NMT’s use in practical deployments and services, where both accuracy andspeed are essential. In this work, we present GNMT, Google’s Neural Machine Translation system, whichattempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoderand 8 decoder layers using residual connections as well as attention connections from the decoder networkto the encoder. To improve parallelism and therefore decrease training time, our attention mechanismconnects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translationspeed, we employ low-precision arithmetic during inference computations. To improve handling of rarewords, we divide words into a limited set of common sub-word units (“wordpieces”) for both input andoutput. This method provides a good balance between the flexibility of “character”-delimited models andthe efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimatelyimproves the overall accuracy of the system. Our beam search technique employs a length-normalizationprocedure and uses a coverage penalty, which encourages generation of an output sentence that is mostlikely to cover all the words in the source sentence. To directly optimize the translation BLEU scores,we consider refining the models by using reinforcement learning, but we found that the improvementin the BLEU scores did not reflect in the human evaluation. On the WMT’14 English-to-French andEnglish-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a humanside-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of60% compared to Google’s phrase-based production system.

1 IntroductionNeural Machine Translation (NMT) [41, 2] has recently been introduced as a promising approach with thepotential of addressing many shortcomings of traditional machine translation systems. The strength of NMTlies in its ability to learn directly, in an end-to-end fashion, the mapping from input text to associatedoutput text. Its architecture typically consists of two recurrent neural networks (RNNs), one to consume theinput text sequence and one to generate translated output text. NMT is often accompanied by an attentionmechanism [2] which helps it cope effectively with long input sequences.

An advantage of Neural Machine Translation is that it sidesteps many brittle design choices in traditionalphrase-based machine translation [26]. In practice, however, NMT systems used to be worse in accuracy thanphrase-based translation systems, especially when training on very large-scale datasets as used for the verybest publicly available translation systems. Three inherent weaknesses of Neural Machine Translation are

1

arX

iv:1

609.

0814

4v2

[cs

.CL

] 8

Oct

201

6

...extraordinary claims require extraordinary evidence


Achieving Human Parity

laudable...follows best practices with WMT-style evaluation

data released for scientific scrutiny (outputs, references, rankings)


Achieving Human Parity

...but warrants further scrutinyfailure to reject null hypothesis is not evidence of parity

alternative hypothesis:human raters prefer human translations on a document-levelrationale:

context helps raters understand text and spot semantic errorsdiscourse errors are invisible in sentence-level evaluation


A Case for Document-level Evaluation[Läubli, Sennrich, Volk, EMNLP 2018]

can we reproduce Microsoft’s finding with different evaluation protocol?

original evaluation our evaluationtest set WMT17 WMT17 (native Chinese part)system Microsoft COMBO-6 Microsoft COMBO-6raters crowd-workers professional translatorsexperimental unit sentence sentence / documentmeasurement direct assessment pairwise rankingraters see reference no noraters see source yes yes / noratings ≥ 2,520 per system ≈ 200 per setting


Which Text is Better?

市民在日常出行中,发现爱车被陌生车辆阻碍了,在联系不上陌生车辆司机的情况下,可以使用“微信挪车”功能解决这一困扰。

8月11日起,西安交警微信服务号“西安交警”推出“微信挪车”服务。

这项服务推出后,日常生活中,市民如遇陌生车辆在驾驶人不在现场的情况下阻碍自己车辆行驶时,就可通过使用“微信挪车”功能解决此类问题。[...]

Members of the public who find their carsobstructed by unfamiliar vehicles during theirdaily journeys can use the "Twitter Move Car"feature to address this distress when the driverof the unfamiliar vehicle cannot be reached.

On August 11, Xi’an traffic police WeChatservice number "Xi’an traffic police" launched"WeChat mobile" service.

With the launch of the service, members ofthe public can tackle such problems in theirdaily lives by using the "WeChat Move" fea-ture when an unfamiliar vehicle obstructs themovement of their vehicle while the driver isnot at the scene. [...]

A citizen whose car is obstructed by vehicleand is unable to contact the owner of the ob-structing vehicle can use the "WeChat Movethe Car" function to address the issue.

The Xi’an Traffic Police WeChat official ac-count "Xi’an Jiaojing" released the "WeChatMove the Car" service since August 11.

Once the service was released, a fellow citi-zen whose car was obstructed by another ve-hicle and where the driver of the vehicle wasnot present, the citizen could use the "WeChatMove the Car" function to address the issue.[...]


Which Text is Better?市民在日常出行中,发现爱车被陌生车辆阻碍了,在联系不上陌生车辆司机的情况下,可以使

用“微信挪车”功能解决这一困扰。

8月11日起,西安交警微信服务号“西安交警”推出“微信挪车”服务。

这项服务推出后,日常生活中,市民如遇陌生车辆在驾驶人不在现场的情况下阻碍自己车辆行驶时,就可通过使用“微信挪车”功能解决此类问题。[...]

Members of the public who find their carsobstructed by unfamiliar vehicles during theirdaily journeys can use the "Twitter Move Car"feature to address this distress when the driverof the unfamiliar vehicle cannot be reached.

On August 11, Xi’an traffic police WeChatservice number "Xi’an traffic police" launched"WeChat mobile" service.

With the launch of the service, members ofthe public can tackle such problems in theirdaily lives by using the "WeChat Move" fea-ture when an unfamiliar vehicle obstructs themovement of their vehicle while the driver isnot at the scene. [...]

A citizen whose car is obstructed by vehicleand is unable to contact the owner of the ob-structing vehicle can use the "WeChat Movethe Car" function to address the issue.

The Xi’an Traffic Police WeChat official ac-count "Xi’an Jiaojing" released the "WeChatMove the Car" service since August 11.

Once the service was released, a fellow citi-zen whose car was obstructed by another ve-hicle and where the driver of the vehicle wasnot present, the citizen could use the "WeChatMove the Car" function to address the issue.[...]


Evaluation Results: Bilingual Assessment

MT TIE HUMAN

0

20

40

60

50.0

9.0

41.0

Win

neri

npa

irwis

era

nkin

g(in

%)

sentence-leveldocument-level


Evaluation Results: Bilingual Assessment

MT TIE HUMAN

0

20

40

60

50.0

9.0

41.037.0

11.0

52.0

Win

neri

npa

irwis

era

nkin

g(in

%)



Evaluation Results: Monolingual Assessment

MT TIE HUMAN

0

20

40

60

32.0

17.0

51.0

Win

neri

npa

irwis

era

nkin

g(in

%)



Evaluation Results: Monolingual Assessment

MT TIE HUMAN

0

20

40

60

32.0

17.0

51.0

22.0

29.0

50.0

Win

neri

npa

irwis

era

nkin

g(in

%)



A Case for Document-level Evaluation

document-level ratings show significant preference for HUMAN

preference for HUMAN is even stronger in monolingual evaluation

Conclusionsdistinguishing MT from human translations becomes harder withincreasing quality

document-level evaluation shows some limitations of current NMTsystems


Conclusions

NMT has made tremendous progress in past years→ to make progress, we need to regularly re-evaluate its weaknesses

word sense disambiguation, long sentences, low-resource settingsare still challenging, but no longer embarassingly bad

plenty of challenges remain, such as document-level translation


Thank you for your attention

ResourcesWSD Test Suite:https://github.com/a-rios/ContraWSD

Evaluation data on human parity:https://github.com/laeubli/parity


https://github.com/a-rios/ContraWSD

https://github.com/laeubli/parity

Bibliography I

Artetxe, M., Labaka, G., Agirre, E., and Cho, K. (2017).Unsupervised Neural Machine Translation.CoRR, abs/1710.11041.

Ba, L. J., Kiros, R., and Hinton, G. E. (2016).Layer Normalization.CoRR, abs/1607.06450.

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Castilho, S., Moorkens, J., Gaspari, F., Sennrich, R., Way, A., and Georgakopoulou, P. (2018).Evaluating MT for massive open online courses.Machine Translation.

Chen, Y., Liu, Y., Cheng, Y., and Li, V. O. (2017).A Teacher-Student Framework for Zero-Resource Neural Machine Translation.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1925–1935, Vancouver, Canada. Association for Computational Linguistics.

Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2016).Semi-Supervised Learning for Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1965–1974, Berlin, Germany.


Bibliography II

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.

Currey, A., Miceli Barone, A. V., and Heafield, K. (2017).Copied Monolingual Data Improves Low-Resource Neural Machine Translation.In Proceedings of the Second Conference on Machine Translation, pages 148–156, Copenhagen, Denmark. Association forComputational Linguistics.

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. (2017).Convolutional Sequence to Sequence Learning.CoRR, abs/1705.03122.

Gülçehre, Ç., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).On Using Monolingual Corpora in Neural Machine Translation.CoRR, abs/1503.03535.

He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., and Ma, W.-Y. (2016).Dual Learning for Machine Translation.In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors,Advances in Neural Information Processing Systems 29, pages 820–828. Curran Associates, Inc.

Johansen, A. R., Hansen, J. M., Obeid, E. K., Sønderby, C. K., and Winther, O. (2016).Neural Machine Translation with Characters and Hierarchical Encoding.CoRR, abs/1610.06550.


Bibliography III

Koehn, P. and Knowles, R. (2017).Six Challenges for Neural Machine Translation.In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for ComputationalLinguistics.

Lample, G., Denoyer, L., and Ranzato, M. (2017).Unsupervised Machine Translation Using Monolingual Corpora Only.CoRR, abs/1711.00043.

Läubli, S., Sennrich, R., and Volk, M. (2018).Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation.In EMNLP 2018, Brussels, Belgium. Association for Computational Linguistics.

Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016).Multi-task Sequence to Sequence Learning.In ICLR 2016.

Miceli Barone, A. V., Helcl, J., Sennrich, R., Haddow, B., and Birch, A. (2017).Deep Architectures for Neural Machine Translation.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.Association for Computational Linguistics.

Murray, K. and Chiang, D. (2018).Correcting Length Bias in Neural Machine Translation.In Proceedings of the Third Conference on Machine Translation, pages 212–223, Belgium, Brussels. Association forComputational Linguistics.


Bibliography IV

Nguyen, T. and Chiang, D. (2018).Improving Lexical Choice in Neural Machine Translation.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),pages 334–343, New Orleans, Louisiana. Association for Computational Linguistics.

Nguyen, T. Q. and Chiang, D. (2017).Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation.In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages296–301, Taipei, Taiwan. Asian Federation of Natural Language Processing.

Press, O. and Wolf, L. (2017).Using the Output Embedding to Improve Language Models.In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL),Valencia, Spain.

Ramachandran, P., Liu, P. J., and Le, Q. V. (2016).Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683.

Rios, A., Mascarell, L., and Sennrich, R. (2017).Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.

Rios, A., Müller, M., and Sennrich, R. (2018).The Word Sense Disambiguation Test Suite at WMT18.In Proceedings of the Third Conference on Machine Translation, pages 594–602, Belgium, Brussels. Association forComputational Linguistics.


Bibliography V

Sennrich, R. (2017).How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs.In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL),Valencia, Spain.

Sennrich, R., Haddow, B., and Birch, A. (2016).Improving Neural Machine Translation Models with Monolingual Data.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages86–96, Berlin, Germany.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15:1929–1958.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016).Rethinking the Inception Architecture for Computer Vision.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.

Tang, G., Müller, M., Rios, A., and Sennrich, R. (2018).Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures.In EMNLP 2018, Brussels, Belgium. Association for Computational Linguistics.


Bibliography VI

Tu, Z., Liu, Y., Shang, L., Liu, X., and Li, H. (2017).Neural Machine Translation with Reconstruction.InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.,pages 3097–3103.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).Attention Is All You Need.CoRR, abs/1706.03762.

Zoph, B., Yuret, D., May, J., and Knight, K. (2016).Transfer Learning for Low-Resource Neural Machine Translation.CoRR, abs/1604.02201.


Revisiting Challenges in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/cwmt.pdfthe test...

Documents

Transcript of Revisiting Challenges in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/cwmt.pdfthe test...