Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et...

7
Neural Machine Translation with Error Correction Kaitao Song 1 * , Xu Tan 2 and Jianfeng Lu 1 1 Nanjing University of Science and Technology 2 Microsoft Research Asia {kt.song, lujf}@njust.edu.cn, [email protected] Abstract Neural machine translation (NMT) generates the next target token given as input the previous ground truth target tokens during training while the previ- ous generated target tokens during inference, which causes discrepancy between training and inference as well as error propagation, and affects the trans- lation accuracy. In this paper, we introduce an er- ror correction mechanism into NMT, which cor- rects the error information in the previous gener- ated tokens to better predict the next token. Specif- ically, we introduce two-stream self-attention from XLNet into NMT decoder, where the query stream is used to predict the next token, and meanwhile the content stream is used to correct the error in- formation from the previous predicted tokens. We leverage scheduled sampling to simulate the pre- diction errors during training. Experiments on three IWSLT translation datasets and two WMT translation datasets demonstrate that our method achieves improvements over Transformer baseline and scheduled sampling. Further experimental analyses also verify the effectiveness of our pro- posed error correction mechanism to improve the translation quality. 1 Introduction Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan et al., 2018] have witnessed great progress due to the development of deep learning. The popular NMT mod- els adopt an encoder-attention-decoder framework, where the decoder generates the target token based on previous tokens in an autoregressive manner. While its popularity, NMT mod- els suffer from discrepancy between training and inference and the consequent error propagation [Bengio et al., 2015; Ranzato et al., 2016; Wu et al., 2019]. During inference, the decoder predicts the next token given previous generated to- kens as input, which is discrepant from that in training, where the previous ground-truth tokens as used as input for next to- ken prediction. Consequently, the previous predicted tokens * Corresponding Author: Jianfeng Lu. may have errors, which would cause error propagation and affect the prediction of next tokens. Previous works have tried different methods to solve the above issues, where some of them focus on simulating the data that occurs in inference for training, such as data as demonstration [Venkatraman et al., 2015], scheduled sam- pling [Bengio et al., 2015], sentence-level scheduled sam- pling [Zhang et al., 2019], or even predict them in different directions [Wu et al., 2018; Tan et al., 2019]. While being effective to handle the prediction errors occurred in inference during model training, these methods still leverage the pre- dicted tokens that could be erroneous as the conditional infor- mation to predict the next token. Forcing the model to predict correct next token given incorrect previous tokens could be particularly hard and misleading for optimization, and can- not solve the training/inference discrepancy as well as error propagation effectively. In this paper, moving beyond scheduled sampling [Ben- gio et al., 2015], we propose a novel method to enable the model to correct the previous predicted tokens when predict- ing the next token. By this way, although the decoder may have prediction errors, the model can learn the capability to build correct representations layer by layer based on the error tokens as input, which is more precise for next token pre- diction than directly relying on previous erroneous tokens as used in scheduled sampling. Specifically, we introduce two-stream self-attention, which is designed for language understanding in XLNet [Yang et al., 2019], into the NMT decoder to correct the errors while translation. Two-stream self-attention is originally proposed to solve the permutation language modeling, which consists of two self-attention mechanisms: the content stream is ex- actly the same as normal self-attention in Transformer de- coder and is used to build the representations of the previous tokens, while the query stream uses the positional embedding as the inputs to decide the position of the next token to be predicted. In our work, we reinvent two-stream self-attention to support simultaneous correction and translation in NMT, where the content stream is used to correct the previous pre- dicted tokens (correction), and the query stream is used to simultaneously predict the next token with a normal left-to- right order based on the corrected context (translation). We conduct experiments on IWSLT 2014 German-English, Spanish-English, Hebrew-English and WMT 2014 English- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 3891

Transcript of Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et...

Page 1: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

Neural Machine Translation with Error Correction

Kaitao Song1 lowast Xu Tan2 and Jianfeng Lu1

1Nanjing University of Science and Technology2Microsoft Research Asia

ktsong lujfnjusteducn xutamicrosoftcom

AbstractNeural machine translation (NMT) generates thenext target token given as input the previous groundtruth target tokens during training while the previ-ous generated target tokens during inference whichcauses discrepancy between training and inferenceas well as error propagation and affects the trans-lation accuracy In this paper we introduce an er-ror correction mechanism into NMT which cor-rects the error information in the previous gener-ated tokens to better predict the next token Specif-ically we introduce two-stream self-attention fromXLNet into NMT decoder where the query streamis used to predict the next token and meanwhilethe content stream is used to correct the error in-formation from the previous predicted tokens Weleverage scheduled sampling to simulate the pre-diction errors during training Experiments onthree IWSLT translation datasets and two WMTtranslation datasets demonstrate that our methodachieves improvements over Transformer baselineand scheduled sampling Further experimentalanalyses also verify the effectiveness of our pro-posed error correction mechanism to improve thetranslation quality

1 IntroductionNeural machine translation (NMT) [Bahdanau et al 2014Sutskever et al 2014 Vaswani et al 2017 Song et al 2018Hassan et al 2018] have witnessed great progress due tothe development of deep learning The popular NMT mod-els adopt an encoder-attention-decoder framework where thedecoder generates the target token based on previous tokensin an autoregressive manner While its popularity NMT mod-els suffer from discrepancy between training and inferenceand the consequent error propagation [Bengio et al 2015Ranzato et al 2016 Wu et al 2019] During inference thedecoder predicts the next token given previous generated to-kens as input which is discrepant from that in training wherethe previous ground-truth tokens as used as input for next to-ken prediction Consequently the previous predicted tokenslowastCorresponding Author Jianfeng Lu

may have errors which would cause error propagation andaffect the prediction of next tokens

Previous works have tried different methods to solve theabove issues where some of them focus on simulating thedata that occurs in inference for training such as data asdemonstration [Venkatraman et al 2015] scheduled sam-pling [Bengio et al 2015] sentence-level scheduled sam-pling [Zhang et al 2019] or even predict them in differentdirections [Wu et al 2018 Tan et al 2019] While beingeffective to handle the prediction errors occurred in inferenceduring model training these methods still leverage the pre-dicted tokens that could be erroneous as the conditional infor-mation to predict the next token Forcing the model to predictcorrect next token given incorrect previous tokens could beparticularly hard and misleading for optimization and can-not solve the traininginference discrepancy as well as errorpropagation effectively

In this paper moving beyond scheduled sampling [Ben-gio et al 2015] we propose a novel method to enable themodel to correct the previous predicted tokens when predict-ing the next token By this way although the decoder mayhave prediction errors the model can learn the capability tobuild correct representations layer by layer based on the errortokens as input which is more precise for next token pre-diction than directly relying on previous erroneous tokens asused in scheduled sampling

Specifically we introduce two-stream self-attention whichis designed for language understanding in XLNet [Yang etal 2019] into the NMT decoder to correct the errors whiletranslation Two-stream self-attention is originally proposedto solve the permutation language modeling which consistsof two self-attention mechanisms the content stream is ex-actly the same as normal self-attention in Transformer de-coder and is used to build the representations of the previoustokens while the query stream uses the positional embeddingas the inputs to decide the position of the next token to bepredicted In our work we reinvent two-stream self-attentionto support simultaneous correction and translation in NMTwhere the content stream is used to correct the previous pre-dicted tokens (correction) and the query stream is used tosimultaneously predict the next token with a normal left-to-right order based on the corrected context (translation)

We conduct experiments on IWSLT 2014 German-EnglishSpanish-English Hebrew-English and WMT 2014 English-

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3891

119910119905minus1 119910119905

119910119905 119910119905+1

119910119905+1

119910119905+2

(a) Training

119910119905minus1 119910119905prime

119910119905prime 119910119905+1

prime

119910119905+1prime

119910119905+2prime

(b) Inference

Figure 1 The discrepancy between training and inference in autore-gressive sequence generation

German and English-Romanian translation datasets to evalu-ate the effectiveness of our proposed error correction mech-anism for NMT Experimental results demonstrate that ourmethod achieves improvements over Transformer baseline onall tasks Further experimental analyses also verify the effec-tiveness of error correction to improve the translation accu-racy

Our contributions can be summarized as follows

bull To the best of our knowledge we are the first to intro-duce an error correction mechanism during the transla-tion process of NMT with the help of the newly pro-posed two-stream self-attention

bull Experimental results on a variety of NMT datasetsand further experimental analyses demonstrate that ourmethod achieves improvements over Transformer base-line and scheduled sampling verifying the effectivenessof our correction mechanism

2 BackgroundIn this section we introduce the background of our workincluding the standard encoder-decoder framework expo-sure bias and error propagation and two-stream self-attentionmechanism

Encoder-decoder framework Given a sentence pairx y isin (X Y) the objective of an NMT model is to maxi-mum the log-likelihood probability P(y|x θ) where θ is theparameters of NMT model The objective function is equiv-alent to a chain of the conditional probability P(y|x θ) =prodnt=1 P(yt|yltt x θ) where n is the number of tokens in tar-

get sequence y and yltt means the target tokens before posi-tion t The encoder-decoder structure [Sutskever et al 2014Cho et al 2014] is the most common framework to solvethe NMT task It adopts an encoder to transform the sourcesentence x as the contextual information h and a decoder topredict the next token yt based on the previous target tokensyltt and h autoregressively Specifically for the t-th tokenprediction the decoder feeds the last token ytminus1 as the inputto predict the target token yt Besides an encoder-decoder at-tention mechanism [Bahdanau et al 2014] is used to bridgethe connection between source and target sentence

Exposure bias and error propagation Exposurebias [Ranzato et al 2016] is a troublesome problem in thelanguage generation During the training stage for language

generation it always takes ground truth tokens as the modelinput However at the test stage the decoder dependson its previous predictions to infer next token Figure 1illustrates us an example of the discrepancy between trainingand inference in autoregressive sequence generation Oncean incorrect token is predicted the error will accumulatecontinuously along the inference process To alleviate thisproblem the common solution is to replace some groundtruth tokens by predicted tokens at the training time which isnamed as scheduled sampling However scheduled samplingstill cannot handle the exposure bias perfectly since it onlyattempts to predict the next ground truth according to theincorrect predicted tokens but cannot reduce the negativeeffect from incorrect tokens Therefore we propose a novelcorrection mechanism during translation to alleviate the errorpropagation

Two-stream self-attention XLNet is one of the famouspre-trained methods [Radford et al 2019 Devlin et al 2019Yang et al 2019 Song et al 2019] for natural language pro-cessing It first proposed a two-stream self-attention mecha-nism which consists of a content stream and a query streamfor permutation language modeling For token yt it can seetokens ylet in the content stream while only see tokens yltt inthe query stream Beneficial from two-stream self-attentionmechanism the model can predict the next token in any posi-tion with the corresponding position embedding as the queryin order to enable permutation language modeling BesidesTwo-stream self-attention also avoids the pre-trained modelto use [MASK] token into the conditioned part during thetraining where the [MASK] token would bring mismatch be-tween pre-training and fine-tuning In this paper we leveragethe advantages of two-stream self-attention to design a novelerror correction mechanism for NMT

3 MethodPrevious NMT models [Bahdanau et al 2014 Vaswani etal 2017] generate the next tokens yprimet from the probability(ie yprimet sim P(yt|yltt x θ)) If yprimet is predicted with errorand taken as the next decoder input can we force model toautomatically build correct hidden representations that areclose to the ground truth token yt In this case the conse-quent token generation can build upon the previous correctrepresentations and become more precise A natural ideais to optimize the model to maximize the correction prob-ability P(yt|yprimet yltt x θ) simultaneously when maximizingthe probability of next token prediction P(yt+1|yprimet yltt x θ)However the previous NMT decoder does not well supportthis correction mechanism Inspired by the two-stream self-attention in XLNet [Yang et al 2019] we leverage the con-tent stream to maximize P(yt|yprimet yltt x θ) and the querystream to maximize P(yt+1|yprimet yltt x θ) which can wellmeet the requirements of simultaneous correction and trans-lation

In order to introduce our method more clearly in the fol-lowing subsections we first introduce the integration of two-stream self-attention into NMT model and then introduce theerror correction mechanism for NMT based on two-streamself-attention

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3892

1199100 1199101 1199102 1199103 1199104

1199101 1199102 1199103 1199104 1199105

x1 x2 x3 x4 x5(a) Standard NMT

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102 1199103 1199104 1199105

Content Stream

Query Stream

(b) NMT with two-stream self-attention pi means the position of i-thtoken The red dashed line means that the query stream can attend tothe content stream

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102prime 1199103 1199104

prime 1199105

Content Stream

Query Stream

1199102 1199104

(c) NMT with our proposed error correction mechanism based ontwo-stream self-attention yprimei is predicted tokens by the model itselfThe cells in red color represent that the potential error token yprime2 andyprime4 are corrected into the ground truth tokens y2 and y4

Figure 2 The illustrations of standard NMT NMT with two-streamself-attention and our proposed error correction mechanism on two-stream self-attention

31 NMT with Two-Stream Self-Attention

Inspired by XLNet [Yang et al 2019] we incorporate theidea of two-stream self-attention to modify the decoding ofNMT framework Specifically the encoder is the same asthe standard NMT model and the decoder is incorporatedwith two-stream self-attention where the positional embed-ding is taken as input in query stream for prediction whilecontent stream is used to build context representations Dif-ferent from that in XLNet we make two modifications 1) weremove the permutation language modeling in XLNet sincethe decoder in NMT usually only uses left-to-right genera-tion and 2) we let the decoder to predict the whole sequence

rather than partial sentence in XLNet Figure 2a and Fig-ure 2b show the differences between the standard NMT andthe NMT with two-stream self-attention

We formulate the NMT with two-stream self-attention asfollows For the decoder we feed the positions p1 middot middot middot pnto the query stream to provide the position information for thenext token prediction and the sequence y1 middot middot middot yn plus itspositions p1 middot middot middot pn to the content stream to build con-textual information For the l-th layer we define the hiddenstates of querycontent streams as qlt and clt The updates forthe query and content streams are as follows

ql+1t = Attention(Q = qltKV = clltth θl+1) (1)

cl+1t = Attention(Q = cltKV = clleth θl+1) (2)

where h represents the hidden states from encoder outputsand θl+1 represents the parameters of the l + 1 layer Q andKV represents the query key and value in self-attention Boththe query and content stream share the same model parame-ters The states of key and value can be reused in both queryand content streams Finally we feed the outputs of the querystream from the last layer to calculate the log-probability forthe next target-token prediction During inference we firstpredict the next token with the query stream and then updatethe content stream with the generated token The order ofquery and content stream will not affect the predictions sincethe tokens in the query stream only depend on the previouslygenerated tokens of the content streams

32 Error Correction based on Two-StreamSelf-Attention

Benefiting from the two-stream self-attention we can natu-rally introduce an error correction mechanism on the contentstream The content stream is originally designed to buildthe representations of previous tokens which is used in querystream for next token prediction In order to correct errorsthe content stream also needs to predict the correct tokensgiven incorrect tokens as input

In order to simulate the prediction errors in the input of thecontent stream we also leverage scheduled sampling [Ben-gio et al 2015] to randomly sample tokens either from theground truth y = y1 middot middot middot yn or the previously predictedtokens yprime = yprime1 middot middot middot yprimen with a certain probability as thenew inputs y = y1 middot middot middot yn where yprimet is sampled fromthe probability distribution P(yt|yltt x θ) For inputs yt itequals to yt with a probability p(middot) otherwise yprimet For eachtoken yprimet (yprimet 6= yt) predicted by the query stream in step t weforce the content stream to predict its corresponding groundtruth token yt again The loss function for the error correctionmechanism (ECM) is formulated as

LECM(y|y x θ) = minusnsumt=1

1(yt 6= yt) log(P(yt|ylet x θ))

(3)In the mechanism the content stream can learn to graduallycorrect the hidden representations of error tokens toward thecorrect counterpart layer by layer

The query stream is still used to predict the the next to-ken given a random mixture of previous predicted tokens and

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3893

ground truth tokens The negative log-likelihood (NLL) lossis formulated as

LNLL(y|y x θ) = minusnsumt=1

log(P(yt|yltt x θ)) (4)

Finally we combine the two loss functions as the final objec-tive function for our method

minLNLL(y|y x θ) + λ middot LECM(y|y x θ) (5)

where λ is a hyperparameter to balance the NLL loss andECM loss

Figure 2c demonstrates the workflow of our proposed er-ror correction mechanism The difference between our errorcorrection mechanism and the naive scheduled sampling isthat once an error token is predicted in scheduled samplingthe model still learns to predict the next correct token givenerror tokens as context which could confuse the the modeland mislead to learn incorrect prediction patterns Howeverbased on our error correction mechanism the next token pre-diction is built upon the representations that are corrected bythe content stream and is more precise to learn predictionpatterns

In our error correction mechanism how to control thescheduled sampling probability p(middot) and when to sampletokens are important factors for the training Previousworks [Bengio et al 2015] indicated that it is unsuitableto sample tokens from scratch during the training since themodel is still under-fitting and the sampled tokens will be tooerroneous Inspired by OR-NMT [Zhang et al 2019] we de-sign a similar exponential decay function for sampling prob-ability p(middot) but with more restrictions The decay function isset as

p(s) =

1 s le αmax(β micro

micro+exp((sminusα)micro) ) otherwise (6)

where s represents the training step α β and micro are hyperpa-rameters The hyperparameter α means the step when modelstarts to sample tokens and hyperparameter β is the maxi-mum probability for sampling

4 Experimental SettingIn this section we introduce the experimental settings to eval-uate our proposed method including datasets model config-uration training and evaluation

41 DatasetsWe conduct experiments on three IWSLT translation datasets(German Spanish Hebrew rarr English) and two WMTtranslation datasets (English rarr German Romanian) toevaluate our method In the follow sections we abbrevi-ate English German Spanish Hebrew Romanian as ldquoEnrdquoldquoDerdquo ldquoEsrdquo ldquoHerdquo ldquoRordquo

IWSLT datasets For IWSLT14 DerarrEn it contains160K and 7K sentence pairs in training set and valid setWe concatenate TEDtst2010 TEDtst2011 TEDtst2012

TEDdev2010 and TEDtst2012 as the test set For IWLST14EsrarrEn and HerarrEn 1 they contain 180K and 150K bilingualdata for training We choose TEDtst2013 as the valid set andTEDtst2014 as the test set During the data preprocess welearn a 10K byte-pair-coding (BPE) [Sennrich et al 2016] tohandle the vocabularyWMT datasets WMT14 EnrarrDe and WMT16 EnrarrRotranslation tasks contain 45M and 28M bilingual data fortraining Following previous work [Vaswani et al 2017] weconcatenate newstest2012 and newstest2013 as the valid setand choose newstest2014 as the test set for WMT14 EnrarrDeFor WMT16 EnrarrRo we choose newsdev2016 as the validset and newstest2016 as the test set We learn 32K and40K BPE codes to tokenize WMT14 EnrarrDe and WMT16EnrarrRo dataset

42 Model ConfigurationWe choose the state-of-the-art Transformer [Vaswani et al2017] as the default model For IWSLT tasks we use 6 Trans-former blocks where attention heads hidden size and filtersize are 4 512 and 1024 Dropout is set as 03 for IWSLTtasks The parameter size is 39M For WMT tasks we use 6Transformer blocks where attention heads hidden size andfilter size are 16 1024 and 4096 And the parameter size is as214M Dropout is set as 03 and 02 for EnrarrDe and EnrarrRorespectively To make a fair comparison we also list someresults by the original NMT model without two-stream self-attention For the decay function of sampling probability weset a b and micro as 30000 085 and 5000 The λ for LECM

is tuned on the valid set and a optimal choice is 10 Tomanifest the advances of our method we also prepare somestrong baselines for reference including Layer-wise Trans-former [He et al 2018] MIXER [Ranzato et al 2016] onTransformer and Tied-Transformer [Xia et al 2019]

43 Training and EvaluationDuring training we use Adam [Kingma and Ba 2014] as thedefault optimizer with a linear decay of learning rate TheIWSLT tasks are trained on single NVIDIA P40 GPU for100K steps and the WMT tasks are trained with 8 NVIDIAP40 GPUs for 300K steps where each GPU is filled with4096 tokens During inference We use beam search to de-code results The beam size and length penalty is set as 5 and10 for each task except WMT14 EnrarrDe which use a beamsize of 4 and length penalty is 06 by following the previouswork [Vaswani et al 2017] All of the results are reportedby multi-bleu 2 Our code is implemented on fairseq [Ott etal 2019] 3 and we will release our code under this linkhttpsgithubcomStillKeepTryECM-NMT

5 ResultsIn this section we report our result on three IWSLT tasksand two WMT tasks Furthermore we also study each hyper-

1IWSLT datasets can be download from httpswit3fbkeuarchive2014-01texts

2httpsgithubcommoses-smtmosesdecoderblobmasterscriptsgenericmulti-bleuperl

3httpsgithubcompytorchfairseq

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3894

Method DerarrEn EsrarrEn HerarrEn

Tied Transformer 3510 4051Layer-Wise Transformer 3507 4050 -MIXER 3530 4230 -

Transformer Baseline 3478 4178 3532Our method 3570 4305 3649

Table 1 BLEU score on IWSLT14 Translation tasks in differentsetting

Method EnrarrDe EnrarrRo

Tied Transformer 2898 3467Layer-wise Transformer 2901 3443MIXER 2868 3410

Transformer Baseline 2840 3290Our method 2920 3470

Table 2 BLEU score on WMT14 EnrarrDe and WMT16 EnrarrRo

parameter used in our model and conduct ablation study toevaluate our method

51 Results on IWSLT14 Translation TasksThe results of IWLST14 tasks are reported in Table 1 FromTable 1 we find our model with correction mechanism out-performs baseline by 089 099 and 083 points and the orig-inal NMT baseline by 092 127 and 117 points on DerarrEnEsrarrEn and HerarrEn respectively Note that our baseline isstrong enough which is comparable to the current advancedsystems Even within such strong baselines our method stillsachieves consistent improvements in all three tasks Theseimprovements also confirm the effectiveness of our methodin correcting error information

52 Results on WMT Translation TasksIn order to validate the performance of our method on large-scale datasets we also conduct experiments on WMT14EnrarrDe and WMT16 EnrarrRo The results are reported inTable 2 We found when incorporating error correction mech-anism into the NMT model it can achieve 2920 and 3470BLEU score which outperforms our baseline by 08 and 16points in EnrarrDe and EnrarrRo and is comparable to the pre-vious works These significant improvements on two large-scale datasets also demonstrate the effectiveness and robust-ness of our method in solving exposure bias In additionour approach is also compatible with previous works ieour method can achieve better performance if combined withother advanced structure

53 Ablation StudyTo demonstrate the necessity of each component in ourmethod we take a series of ablation study on our modelon IWSLT14 DerarrEn EsrarrEn and WMT14 EnrarrDe Theresults are shown in Table 3 When disabling error cor-rection mechanism (ECM) we find the model accuracy de-creases 030 048 and 050 points respectively in different

Method DerarrEn EsrarrEn EnrarrDe

Our method 3570 4305 2920-ECM 3540 4255 2872-ECM -SS 3481 4216 2848-ECM -SS -TSSA 3478 4181 2840

Table 3 Ablation study of different component in our model Thesecond third and forth column are results on IWSLT14 DerarrEnEsrarrEn and WMT14 EnrarrDe The second row is our method Thethird row is equal to the second row removing error correction mech-anism (ECM) The forth row is equal to the third row removingscheduled sampling (SS) The last raw is equal to the forth row re-moving two-stream self-attention ie the standard NMT The prefixldquo-rdquo means removing this part

tasks When further removing scheduled sampling (SS) themodel accuracy drop to 3481 4116 2848 points in threetasks We observe that in the large-scale dataset the improve-ments for scheduled sampling are limited while our modelstill achieves stable improvements in the large-scale datasetwhich proves the effectiveness of the error correction mech-anism In addition we also make a comparison between theoriginal NMT and the NMT with two-stream self-attention(TSSA) to verify whether two-stream self-attention mecha-nism will contribute to model accuracy improvement FromTable 3 we find the NMT model with TSSA performs slightlybetter than the original NMT model in the small-scale tasksby 003-035 points and is close to the accuracy of the originalNMT model in the large-scale tasks This phenomenon alsoexplains the improvement of our method is mainly brought bythe error correction rather than the two-stream self-attentionIn summary every component all plays an indispensable rolein our model

54 Study of λ for LECM

To investigate the effect of λ in LECM on model accuracy weconduct a series of experiments with different λ on IWSLT14DerarrEn and WMT14 EnrarrDe datasets The results areshown in Figure 3 It can be seen that the accuracy improveswith the growth of λ When λ ge 06 the model accuracy in-creases slowly The model achieves best accuracy at λ = 09in IWSLT14 DerarrEn and λ = 10 in WMT14 EnrarrDe To

(a) IWSLT14 DerarrEn (b) WMT14 EnrarrDe

Figure 3 Results on IWSLT14 DerarrEn and WMT14 EnrarrDe withdifferent λ

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3895

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 2: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

119910119905minus1 119910119905

119910119905 119910119905+1

119910119905+1

119910119905+2

(a) Training

119910119905minus1 119910119905prime

119910119905prime 119910119905+1

prime

119910119905+1prime

119910119905+2prime

(b) Inference

Figure 1 The discrepancy between training and inference in autore-gressive sequence generation

German and English-Romanian translation datasets to evalu-ate the effectiveness of our proposed error correction mech-anism for NMT Experimental results demonstrate that ourmethod achieves improvements over Transformer baseline onall tasks Further experimental analyses also verify the effec-tiveness of error correction to improve the translation accu-racy

Our contributions can be summarized as follows

bull To the best of our knowledge we are the first to intro-duce an error correction mechanism during the transla-tion process of NMT with the help of the newly pro-posed two-stream self-attention

bull Experimental results on a variety of NMT datasetsand further experimental analyses demonstrate that ourmethod achieves improvements over Transformer base-line and scheduled sampling verifying the effectivenessof our correction mechanism

2 BackgroundIn this section we introduce the background of our workincluding the standard encoder-decoder framework expo-sure bias and error propagation and two-stream self-attentionmechanism

Encoder-decoder framework Given a sentence pairx y isin (X Y) the objective of an NMT model is to maxi-mum the log-likelihood probability P(y|x θ) where θ is theparameters of NMT model The objective function is equiv-alent to a chain of the conditional probability P(y|x θ) =prodnt=1 P(yt|yltt x θ) where n is the number of tokens in tar-

get sequence y and yltt means the target tokens before posi-tion t The encoder-decoder structure [Sutskever et al 2014Cho et al 2014] is the most common framework to solvethe NMT task It adopts an encoder to transform the sourcesentence x as the contextual information h and a decoder topredict the next token yt based on the previous target tokensyltt and h autoregressively Specifically for the t-th tokenprediction the decoder feeds the last token ytminus1 as the inputto predict the target token yt Besides an encoder-decoder at-tention mechanism [Bahdanau et al 2014] is used to bridgethe connection between source and target sentence

Exposure bias and error propagation Exposurebias [Ranzato et al 2016] is a troublesome problem in thelanguage generation During the training stage for language

generation it always takes ground truth tokens as the modelinput However at the test stage the decoder dependson its previous predictions to infer next token Figure 1illustrates us an example of the discrepancy between trainingand inference in autoregressive sequence generation Oncean incorrect token is predicted the error will accumulatecontinuously along the inference process To alleviate thisproblem the common solution is to replace some groundtruth tokens by predicted tokens at the training time which isnamed as scheduled sampling However scheduled samplingstill cannot handle the exposure bias perfectly since it onlyattempts to predict the next ground truth according to theincorrect predicted tokens but cannot reduce the negativeeffect from incorrect tokens Therefore we propose a novelcorrection mechanism during translation to alleviate the errorpropagation

Two-stream self-attention XLNet is one of the famouspre-trained methods [Radford et al 2019 Devlin et al 2019Yang et al 2019 Song et al 2019] for natural language pro-cessing It first proposed a two-stream self-attention mecha-nism which consists of a content stream and a query streamfor permutation language modeling For token yt it can seetokens ylet in the content stream while only see tokens yltt inthe query stream Beneficial from two-stream self-attentionmechanism the model can predict the next token in any posi-tion with the corresponding position embedding as the queryin order to enable permutation language modeling BesidesTwo-stream self-attention also avoids the pre-trained modelto use [MASK] token into the conditioned part during thetraining where the [MASK] token would bring mismatch be-tween pre-training and fine-tuning In this paper we leveragethe advantages of two-stream self-attention to design a novelerror correction mechanism for NMT

3 MethodPrevious NMT models [Bahdanau et al 2014 Vaswani etal 2017] generate the next tokens yprimet from the probability(ie yprimet sim P(yt|yltt x θ)) If yprimet is predicted with errorand taken as the next decoder input can we force model toautomatically build correct hidden representations that areclose to the ground truth token yt In this case the conse-quent token generation can build upon the previous correctrepresentations and become more precise A natural ideais to optimize the model to maximize the correction prob-ability P(yt|yprimet yltt x θ) simultaneously when maximizingthe probability of next token prediction P(yt+1|yprimet yltt x θ)However the previous NMT decoder does not well supportthis correction mechanism Inspired by the two-stream self-attention in XLNet [Yang et al 2019] we leverage the con-tent stream to maximize P(yt|yprimet yltt x θ) and the querystream to maximize P(yt+1|yprimet yltt x θ) which can wellmeet the requirements of simultaneous correction and trans-lation

In order to introduce our method more clearly in the fol-lowing subsections we first introduce the integration of two-stream self-attention into NMT model and then introduce theerror correction mechanism for NMT based on two-streamself-attention

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3892

1199100 1199101 1199102 1199103 1199104

1199101 1199102 1199103 1199104 1199105

x1 x2 x3 x4 x5(a) Standard NMT

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102 1199103 1199104 1199105

Content Stream

Query Stream

(b) NMT with two-stream self-attention pi means the position of i-thtoken The red dashed line means that the query stream can attend tothe content stream

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102prime 1199103 1199104

prime 1199105

Content Stream

Query Stream

1199102 1199104

(c) NMT with our proposed error correction mechanism based ontwo-stream self-attention yprimei is predicted tokens by the model itselfThe cells in red color represent that the potential error token yprime2 andyprime4 are corrected into the ground truth tokens y2 and y4

Figure 2 The illustrations of standard NMT NMT with two-streamself-attention and our proposed error correction mechanism on two-stream self-attention

31 NMT with Two-Stream Self-Attention

Inspired by XLNet [Yang et al 2019] we incorporate theidea of two-stream self-attention to modify the decoding ofNMT framework Specifically the encoder is the same asthe standard NMT model and the decoder is incorporatedwith two-stream self-attention where the positional embed-ding is taken as input in query stream for prediction whilecontent stream is used to build context representations Dif-ferent from that in XLNet we make two modifications 1) weremove the permutation language modeling in XLNet sincethe decoder in NMT usually only uses left-to-right genera-tion and 2) we let the decoder to predict the whole sequence

rather than partial sentence in XLNet Figure 2a and Fig-ure 2b show the differences between the standard NMT andthe NMT with two-stream self-attention

We formulate the NMT with two-stream self-attention asfollows For the decoder we feed the positions p1 middot middot middot pnto the query stream to provide the position information for thenext token prediction and the sequence y1 middot middot middot yn plus itspositions p1 middot middot middot pn to the content stream to build con-textual information For the l-th layer we define the hiddenstates of querycontent streams as qlt and clt The updates forthe query and content streams are as follows

ql+1t = Attention(Q = qltKV = clltth θl+1) (1)

cl+1t = Attention(Q = cltKV = clleth θl+1) (2)

where h represents the hidden states from encoder outputsand θl+1 represents the parameters of the l + 1 layer Q andKV represents the query key and value in self-attention Boththe query and content stream share the same model parame-ters The states of key and value can be reused in both queryand content streams Finally we feed the outputs of the querystream from the last layer to calculate the log-probability forthe next target-token prediction During inference we firstpredict the next token with the query stream and then updatethe content stream with the generated token The order ofquery and content stream will not affect the predictions sincethe tokens in the query stream only depend on the previouslygenerated tokens of the content streams

32 Error Correction based on Two-StreamSelf-Attention

Benefiting from the two-stream self-attention we can natu-rally introduce an error correction mechanism on the contentstream The content stream is originally designed to buildthe representations of previous tokens which is used in querystream for next token prediction In order to correct errorsthe content stream also needs to predict the correct tokensgiven incorrect tokens as input

In order to simulate the prediction errors in the input of thecontent stream we also leverage scheduled sampling [Ben-gio et al 2015] to randomly sample tokens either from theground truth y = y1 middot middot middot yn or the previously predictedtokens yprime = yprime1 middot middot middot yprimen with a certain probability as thenew inputs y = y1 middot middot middot yn where yprimet is sampled fromthe probability distribution P(yt|yltt x θ) For inputs yt itequals to yt with a probability p(middot) otherwise yprimet For eachtoken yprimet (yprimet 6= yt) predicted by the query stream in step t weforce the content stream to predict its corresponding groundtruth token yt again The loss function for the error correctionmechanism (ECM) is formulated as

LECM(y|y x θ) = minusnsumt=1

1(yt 6= yt) log(P(yt|ylet x θ))

(3)In the mechanism the content stream can learn to graduallycorrect the hidden representations of error tokens toward thecorrect counterpart layer by layer

The query stream is still used to predict the the next to-ken given a random mixture of previous predicted tokens and

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3893

ground truth tokens The negative log-likelihood (NLL) lossis formulated as

LNLL(y|y x θ) = minusnsumt=1

log(P(yt|yltt x θ)) (4)

Finally we combine the two loss functions as the final objec-tive function for our method

minLNLL(y|y x θ) + λ middot LECM(y|y x θ) (5)

where λ is a hyperparameter to balance the NLL loss andECM loss

Figure 2c demonstrates the workflow of our proposed er-ror correction mechanism The difference between our errorcorrection mechanism and the naive scheduled sampling isthat once an error token is predicted in scheduled samplingthe model still learns to predict the next correct token givenerror tokens as context which could confuse the the modeland mislead to learn incorrect prediction patterns Howeverbased on our error correction mechanism the next token pre-diction is built upon the representations that are corrected bythe content stream and is more precise to learn predictionpatterns

In our error correction mechanism how to control thescheduled sampling probability p(middot) and when to sampletokens are important factors for the training Previousworks [Bengio et al 2015] indicated that it is unsuitableto sample tokens from scratch during the training since themodel is still under-fitting and the sampled tokens will be tooerroneous Inspired by OR-NMT [Zhang et al 2019] we de-sign a similar exponential decay function for sampling prob-ability p(middot) but with more restrictions The decay function isset as

p(s) =

1 s le αmax(β micro

micro+exp((sminusα)micro) ) otherwise (6)

where s represents the training step α β and micro are hyperpa-rameters The hyperparameter α means the step when modelstarts to sample tokens and hyperparameter β is the maxi-mum probability for sampling

4 Experimental SettingIn this section we introduce the experimental settings to eval-uate our proposed method including datasets model config-uration training and evaluation

41 DatasetsWe conduct experiments on three IWSLT translation datasets(German Spanish Hebrew rarr English) and two WMTtranslation datasets (English rarr German Romanian) toevaluate our method In the follow sections we abbrevi-ate English German Spanish Hebrew Romanian as ldquoEnrdquoldquoDerdquo ldquoEsrdquo ldquoHerdquo ldquoRordquo

IWSLT datasets For IWSLT14 DerarrEn it contains160K and 7K sentence pairs in training set and valid setWe concatenate TEDtst2010 TEDtst2011 TEDtst2012

TEDdev2010 and TEDtst2012 as the test set For IWLST14EsrarrEn and HerarrEn 1 they contain 180K and 150K bilingualdata for training We choose TEDtst2013 as the valid set andTEDtst2014 as the test set During the data preprocess welearn a 10K byte-pair-coding (BPE) [Sennrich et al 2016] tohandle the vocabularyWMT datasets WMT14 EnrarrDe and WMT16 EnrarrRotranslation tasks contain 45M and 28M bilingual data fortraining Following previous work [Vaswani et al 2017] weconcatenate newstest2012 and newstest2013 as the valid setand choose newstest2014 as the test set for WMT14 EnrarrDeFor WMT16 EnrarrRo we choose newsdev2016 as the validset and newstest2016 as the test set We learn 32K and40K BPE codes to tokenize WMT14 EnrarrDe and WMT16EnrarrRo dataset

42 Model ConfigurationWe choose the state-of-the-art Transformer [Vaswani et al2017] as the default model For IWSLT tasks we use 6 Trans-former blocks where attention heads hidden size and filtersize are 4 512 and 1024 Dropout is set as 03 for IWSLTtasks The parameter size is 39M For WMT tasks we use 6Transformer blocks where attention heads hidden size andfilter size are 16 1024 and 4096 And the parameter size is as214M Dropout is set as 03 and 02 for EnrarrDe and EnrarrRorespectively To make a fair comparison we also list someresults by the original NMT model without two-stream self-attention For the decay function of sampling probability weset a b and micro as 30000 085 and 5000 The λ for LECM

is tuned on the valid set and a optimal choice is 10 Tomanifest the advances of our method we also prepare somestrong baselines for reference including Layer-wise Trans-former [He et al 2018] MIXER [Ranzato et al 2016] onTransformer and Tied-Transformer [Xia et al 2019]

43 Training and EvaluationDuring training we use Adam [Kingma and Ba 2014] as thedefault optimizer with a linear decay of learning rate TheIWSLT tasks are trained on single NVIDIA P40 GPU for100K steps and the WMT tasks are trained with 8 NVIDIAP40 GPUs for 300K steps where each GPU is filled with4096 tokens During inference We use beam search to de-code results The beam size and length penalty is set as 5 and10 for each task except WMT14 EnrarrDe which use a beamsize of 4 and length penalty is 06 by following the previouswork [Vaswani et al 2017] All of the results are reportedby multi-bleu 2 Our code is implemented on fairseq [Ott etal 2019] 3 and we will release our code under this linkhttpsgithubcomStillKeepTryECM-NMT

5 ResultsIn this section we report our result on three IWSLT tasksand two WMT tasks Furthermore we also study each hyper-

1IWSLT datasets can be download from httpswit3fbkeuarchive2014-01texts

2httpsgithubcommoses-smtmosesdecoderblobmasterscriptsgenericmulti-bleuperl

3httpsgithubcompytorchfairseq

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3894

Method DerarrEn EsrarrEn HerarrEn

Tied Transformer 3510 4051Layer-Wise Transformer 3507 4050 -MIXER 3530 4230 -

Transformer Baseline 3478 4178 3532Our method 3570 4305 3649

Table 1 BLEU score on IWSLT14 Translation tasks in differentsetting

Method EnrarrDe EnrarrRo

Tied Transformer 2898 3467Layer-wise Transformer 2901 3443MIXER 2868 3410

Transformer Baseline 2840 3290Our method 2920 3470

Table 2 BLEU score on WMT14 EnrarrDe and WMT16 EnrarrRo

parameter used in our model and conduct ablation study toevaluate our method

51 Results on IWSLT14 Translation TasksThe results of IWLST14 tasks are reported in Table 1 FromTable 1 we find our model with correction mechanism out-performs baseline by 089 099 and 083 points and the orig-inal NMT baseline by 092 127 and 117 points on DerarrEnEsrarrEn and HerarrEn respectively Note that our baseline isstrong enough which is comparable to the current advancedsystems Even within such strong baselines our method stillsachieves consistent improvements in all three tasks Theseimprovements also confirm the effectiveness of our methodin correcting error information

52 Results on WMT Translation TasksIn order to validate the performance of our method on large-scale datasets we also conduct experiments on WMT14EnrarrDe and WMT16 EnrarrRo The results are reported inTable 2 We found when incorporating error correction mech-anism into the NMT model it can achieve 2920 and 3470BLEU score which outperforms our baseline by 08 and 16points in EnrarrDe and EnrarrRo and is comparable to the pre-vious works These significant improvements on two large-scale datasets also demonstrate the effectiveness and robust-ness of our method in solving exposure bias In additionour approach is also compatible with previous works ieour method can achieve better performance if combined withother advanced structure

53 Ablation StudyTo demonstrate the necessity of each component in ourmethod we take a series of ablation study on our modelon IWSLT14 DerarrEn EsrarrEn and WMT14 EnrarrDe Theresults are shown in Table 3 When disabling error cor-rection mechanism (ECM) we find the model accuracy de-creases 030 048 and 050 points respectively in different

Method DerarrEn EsrarrEn EnrarrDe

Our method 3570 4305 2920-ECM 3540 4255 2872-ECM -SS 3481 4216 2848-ECM -SS -TSSA 3478 4181 2840

Table 3 Ablation study of different component in our model Thesecond third and forth column are results on IWSLT14 DerarrEnEsrarrEn and WMT14 EnrarrDe The second row is our method Thethird row is equal to the second row removing error correction mech-anism (ECM) The forth row is equal to the third row removingscheduled sampling (SS) The last raw is equal to the forth row re-moving two-stream self-attention ie the standard NMT The prefixldquo-rdquo means removing this part

tasks When further removing scheduled sampling (SS) themodel accuracy drop to 3481 4116 2848 points in threetasks We observe that in the large-scale dataset the improve-ments for scheduled sampling are limited while our modelstill achieves stable improvements in the large-scale datasetwhich proves the effectiveness of the error correction mech-anism In addition we also make a comparison between theoriginal NMT and the NMT with two-stream self-attention(TSSA) to verify whether two-stream self-attention mecha-nism will contribute to model accuracy improvement FromTable 3 we find the NMT model with TSSA performs slightlybetter than the original NMT model in the small-scale tasksby 003-035 points and is close to the accuracy of the originalNMT model in the large-scale tasks This phenomenon alsoexplains the improvement of our method is mainly brought bythe error correction rather than the two-stream self-attentionIn summary every component all plays an indispensable rolein our model

54 Study of λ for LECM

To investigate the effect of λ in LECM on model accuracy weconduct a series of experiments with different λ on IWSLT14DerarrEn and WMT14 EnrarrDe datasets The results areshown in Figure 3 It can be seen that the accuracy improveswith the growth of λ When λ ge 06 the model accuracy in-creases slowly The model achieves best accuracy at λ = 09in IWSLT14 DerarrEn and λ = 10 in WMT14 EnrarrDe To

(a) IWSLT14 DerarrEn (b) WMT14 EnrarrDe

Figure 3 Results on IWSLT14 DerarrEn and WMT14 EnrarrDe withdifferent λ

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3895

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 3: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

1199100 1199101 1199102 1199103 1199104

1199101 1199102 1199103 1199104 1199105

x1 x2 x3 x4 x5(a) Standard NMT

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102 1199103 1199104 1199105

Content Stream

Query Stream

(b) NMT with two-stream self-attention pi means the position of i-thtoken The red dashed line means that the query stream can attend tothe content stream

x1 x2 x3 x4 x5

p1 p2 p3 p4 p5

1199101 1199102 1199103 1199104 1199105

1199101 1199102prime 1199103 1199104

prime 1199105

Content Stream

Query Stream

1199102 1199104

(c) NMT with our proposed error correction mechanism based ontwo-stream self-attention yprimei is predicted tokens by the model itselfThe cells in red color represent that the potential error token yprime2 andyprime4 are corrected into the ground truth tokens y2 and y4

Figure 2 The illustrations of standard NMT NMT with two-streamself-attention and our proposed error correction mechanism on two-stream self-attention

31 NMT with Two-Stream Self-Attention

Inspired by XLNet [Yang et al 2019] we incorporate theidea of two-stream self-attention to modify the decoding ofNMT framework Specifically the encoder is the same asthe standard NMT model and the decoder is incorporatedwith two-stream self-attention where the positional embed-ding is taken as input in query stream for prediction whilecontent stream is used to build context representations Dif-ferent from that in XLNet we make two modifications 1) weremove the permutation language modeling in XLNet sincethe decoder in NMT usually only uses left-to-right genera-tion and 2) we let the decoder to predict the whole sequence

rather than partial sentence in XLNet Figure 2a and Fig-ure 2b show the differences between the standard NMT andthe NMT with two-stream self-attention

We formulate the NMT with two-stream self-attention asfollows For the decoder we feed the positions p1 middot middot middot pnto the query stream to provide the position information for thenext token prediction and the sequence y1 middot middot middot yn plus itspositions p1 middot middot middot pn to the content stream to build con-textual information For the l-th layer we define the hiddenstates of querycontent streams as qlt and clt The updates forthe query and content streams are as follows

ql+1t = Attention(Q = qltKV = clltth θl+1) (1)

cl+1t = Attention(Q = cltKV = clleth θl+1) (2)

where h represents the hidden states from encoder outputsand θl+1 represents the parameters of the l + 1 layer Q andKV represents the query key and value in self-attention Boththe query and content stream share the same model parame-ters The states of key and value can be reused in both queryand content streams Finally we feed the outputs of the querystream from the last layer to calculate the log-probability forthe next target-token prediction During inference we firstpredict the next token with the query stream and then updatethe content stream with the generated token The order ofquery and content stream will not affect the predictions sincethe tokens in the query stream only depend on the previouslygenerated tokens of the content streams

32 Error Correction based on Two-StreamSelf-Attention

Benefiting from the two-stream self-attention we can natu-rally introduce an error correction mechanism on the contentstream The content stream is originally designed to buildthe representations of previous tokens which is used in querystream for next token prediction In order to correct errorsthe content stream also needs to predict the correct tokensgiven incorrect tokens as input

In order to simulate the prediction errors in the input of thecontent stream we also leverage scheduled sampling [Ben-gio et al 2015] to randomly sample tokens either from theground truth y = y1 middot middot middot yn or the previously predictedtokens yprime = yprime1 middot middot middot yprimen with a certain probability as thenew inputs y = y1 middot middot middot yn where yprimet is sampled fromthe probability distribution P(yt|yltt x θ) For inputs yt itequals to yt with a probability p(middot) otherwise yprimet For eachtoken yprimet (yprimet 6= yt) predicted by the query stream in step t weforce the content stream to predict its corresponding groundtruth token yt again The loss function for the error correctionmechanism (ECM) is formulated as

LECM(y|y x θ) = minusnsumt=1

1(yt 6= yt) log(P(yt|ylet x θ))

(3)In the mechanism the content stream can learn to graduallycorrect the hidden representations of error tokens toward thecorrect counterpart layer by layer

The query stream is still used to predict the the next to-ken given a random mixture of previous predicted tokens and

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3893

ground truth tokens The negative log-likelihood (NLL) lossis formulated as

LNLL(y|y x θ) = minusnsumt=1

log(P(yt|yltt x θ)) (4)

Finally we combine the two loss functions as the final objec-tive function for our method

minLNLL(y|y x θ) + λ middot LECM(y|y x θ) (5)

where λ is a hyperparameter to balance the NLL loss andECM loss

Figure 2c demonstrates the workflow of our proposed er-ror correction mechanism The difference between our errorcorrection mechanism and the naive scheduled sampling isthat once an error token is predicted in scheduled samplingthe model still learns to predict the next correct token givenerror tokens as context which could confuse the the modeland mislead to learn incorrect prediction patterns Howeverbased on our error correction mechanism the next token pre-diction is built upon the representations that are corrected bythe content stream and is more precise to learn predictionpatterns

In our error correction mechanism how to control thescheduled sampling probability p(middot) and when to sampletokens are important factors for the training Previousworks [Bengio et al 2015] indicated that it is unsuitableto sample tokens from scratch during the training since themodel is still under-fitting and the sampled tokens will be tooerroneous Inspired by OR-NMT [Zhang et al 2019] we de-sign a similar exponential decay function for sampling prob-ability p(middot) but with more restrictions The decay function isset as

p(s) =

1 s le αmax(β micro

micro+exp((sminusα)micro) ) otherwise (6)

where s represents the training step α β and micro are hyperpa-rameters The hyperparameter α means the step when modelstarts to sample tokens and hyperparameter β is the maxi-mum probability for sampling

4 Experimental SettingIn this section we introduce the experimental settings to eval-uate our proposed method including datasets model config-uration training and evaluation

41 DatasetsWe conduct experiments on three IWSLT translation datasets(German Spanish Hebrew rarr English) and two WMTtranslation datasets (English rarr German Romanian) toevaluate our method In the follow sections we abbrevi-ate English German Spanish Hebrew Romanian as ldquoEnrdquoldquoDerdquo ldquoEsrdquo ldquoHerdquo ldquoRordquo

IWSLT datasets For IWSLT14 DerarrEn it contains160K and 7K sentence pairs in training set and valid setWe concatenate TEDtst2010 TEDtst2011 TEDtst2012

TEDdev2010 and TEDtst2012 as the test set For IWLST14EsrarrEn and HerarrEn 1 they contain 180K and 150K bilingualdata for training We choose TEDtst2013 as the valid set andTEDtst2014 as the test set During the data preprocess welearn a 10K byte-pair-coding (BPE) [Sennrich et al 2016] tohandle the vocabularyWMT datasets WMT14 EnrarrDe and WMT16 EnrarrRotranslation tasks contain 45M and 28M bilingual data fortraining Following previous work [Vaswani et al 2017] weconcatenate newstest2012 and newstest2013 as the valid setand choose newstest2014 as the test set for WMT14 EnrarrDeFor WMT16 EnrarrRo we choose newsdev2016 as the validset and newstest2016 as the test set We learn 32K and40K BPE codes to tokenize WMT14 EnrarrDe and WMT16EnrarrRo dataset

42 Model ConfigurationWe choose the state-of-the-art Transformer [Vaswani et al2017] as the default model For IWSLT tasks we use 6 Trans-former blocks where attention heads hidden size and filtersize are 4 512 and 1024 Dropout is set as 03 for IWSLTtasks The parameter size is 39M For WMT tasks we use 6Transformer blocks where attention heads hidden size andfilter size are 16 1024 and 4096 And the parameter size is as214M Dropout is set as 03 and 02 for EnrarrDe and EnrarrRorespectively To make a fair comparison we also list someresults by the original NMT model without two-stream self-attention For the decay function of sampling probability weset a b and micro as 30000 085 and 5000 The λ for LECM

is tuned on the valid set and a optimal choice is 10 Tomanifest the advances of our method we also prepare somestrong baselines for reference including Layer-wise Trans-former [He et al 2018] MIXER [Ranzato et al 2016] onTransformer and Tied-Transformer [Xia et al 2019]

43 Training and EvaluationDuring training we use Adam [Kingma and Ba 2014] as thedefault optimizer with a linear decay of learning rate TheIWSLT tasks are trained on single NVIDIA P40 GPU for100K steps and the WMT tasks are trained with 8 NVIDIAP40 GPUs for 300K steps where each GPU is filled with4096 tokens During inference We use beam search to de-code results The beam size and length penalty is set as 5 and10 for each task except WMT14 EnrarrDe which use a beamsize of 4 and length penalty is 06 by following the previouswork [Vaswani et al 2017] All of the results are reportedby multi-bleu 2 Our code is implemented on fairseq [Ott etal 2019] 3 and we will release our code under this linkhttpsgithubcomStillKeepTryECM-NMT

5 ResultsIn this section we report our result on three IWSLT tasksand two WMT tasks Furthermore we also study each hyper-

1IWSLT datasets can be download from httpswit3fbkeuarchive2014-01texts

2httpsgithubcommoses-smtmosesdecoderblobmasterscriptsgenericmulti-bleuperl

3httpsgithubcompytorchfairseq

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3894

Method DerarrEn EsrarrEn HerarrEn

Tied Transformer 3510 4051Layer-Wise Transformer 3507 4050 -MIXER 3530 4230 -

Transformer Baseline 3478 4178 3532Our method 3570 4305 3649

Table 1 BLEU score on IWSLT14 Translation tasks in differentsetting

Method EnrarrDe EnrarrRo

Tied Transformer 2898 3467Layer-wise Transformer 2901 3443MIXER 2868 3410

Transformer Baseline 2840 3290Our method 2920 3470

Table 2 BLEU score on WMT14 EnrarrDe and WMT16 EnrarrRo

parameter used in our model and conduct ablation study toevaluate our method

51 Results on IWSLT14 Translation TasksThe results of IWLST14 tasks are reported in Table 1 FromTable 1 we find our model with correction mechanism out-performs baseline by 089 099 and 083 points and the orig-inal NMT baseline by 092 127 and 117 points on DerarrEnEsrarrEn and HerarrEn respectively Note that our baseline isstrong enough which is comparable to the current advancedsystems Even within such strong baselines our method stillsachieves consistent improvements in all three tasks Theseimprovements also confirm the effectiveness of our methodin correcting error information

52 Results on WMT Translation TasksIn order to validate the performance of our method on large-scale datasets we also conduct experiments on WMT14EnrarrDe and WMT16 EnrarrRo The results are reported inTable 2 We found when incorporating error correction mech-anism into the NMT model it can achieve 2920 and 3470BLEU score which outperforms our baseline by 08 and 16points in EnrarrDe and EnrarrRo and is comparable to the pre-vious works These significant improvements on two large-scale datasets also demonstrate the effectiveness and robust-ness of our method in solving exposure bias In additionour approach is also compatible with previous works ieour method can achieve better performance if combined withother advanced structure

53 Ablation StudyTo demonstrate the necessity of each component in ourmethod we take a series of ablation study on our modelon IWSLT14 DerarrEn EsrarrEn and WMT14 EnrarrDe Theresults are shown in Table 3 When disabling error cor-rection mechanism (ECM) we find the model accuracy de-creases 030 048 and 050 points respectively in different

Method DerarrEn EsrarrEn EnrarrDe

Our method 3570 4305 2920-ECM 3540 4255 2872-ECM -SS 3481 4216 2848-ECM -SS -TSSA 3478 4181 2840

Table 3 Ablation study of different component in our model Thesecond third and forth column are results on IWSLT14 DerarrEnEsrarrEn and WMT14 EnrarrDe The second row is our method Thethird row is equal to the second row removing error correction mech-anism (ECM) The forth row is equal to the third row removingscheduled sampling (SS) The last raw is equal to the forth row re-moving two-stream self-attention ie the standard NMT The prefixldquo-rdquo means removing this part

tasks When further removing scheduled sampling (SS) themodel accuracy drop to 3481 4116 2848 points in threetasks We observe that in the large-scale dataset the improve-ments for scheduled sampling are limited while our modelstill achieves stable improvements in the large-scale datasetwhich proves the effectiveness of the error correction mech-anism In addition we also make a comparison between theoriginal NMT and the NMT with two-stream self-attention(TSSA) to verify whether two-stream self-attention mecha-nism will contribute to model accuracy improvement FromTable 3 we find the NMT model with TSSA performs slightlybetter than the original NMT model in the small-scale tasksby 003-035 points and is close to the accuracy of the originalNMT model in the large-scale tasks This phenomenon alsoexplains the improvement of our method is mainly brought bythe error correction rather than the two-stream self-attentionIn summary every component all plays an indispensable rolein our model

54 Study of λ for LECM

To investigate the effect of λ in LECM on model accuracy weconduct a series of experiments with different λ on IWSLT14DerarrEn and WMT14 EnrarrDe datasets The results areshown in Figure 3 It can be seen that the accuracy improveswith the growth of λ When λ ge 06 the model accuracy in-creases slowly The model achieves best accuracy at λ = 09in IWSLT14 DerarrEn and λ = 10 in WMT14 EnrarrDe To

(a) IWSLT14 DerarrEn (b) WMT14 EnrarrDe

Figure 3 Results on IWSLT14 DerarrEn and WMT14 EnrarrDe withdifferent λ

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3895

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 4: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

ground truth tokens The negative log-likelihood (NLL) lossis formulated as

LNLL(y|y x θ) = minusnsumt=1

log(P(yt|yltt x θ)) (4)

Finally we combine the two loss functions as the final objec-tive function for our method

minLNLL(y|y x θ) + λ middot LECM(y|y x θ) (5)

where λ is a hyperparameter to balance the NLL loss andECM loss

Figure 2c demonstrates the workflow of our proposed er-ror correction mechanism The difference between our errorcorrection mechanism and the naive scheduled sampling isthat once an error token is predicted in scheduled samplingthe model still learns to predict the next correct token givenerror tokens as context which could confuse the the modeland mislead to learn incorrect prediction patterns Howeverbased on our error correction mechanism the next token pre-diction is built upon the representations that are corrected bythe content stream and is more precise to learn predictionpatterns

In our error correction mechanism how to control thescheduled sampling probability p(middot) and when to sampletokens are important factors for the training Previousworks [Bengio et al 2015] indicated that it is unsuitableto sample tokens from scratch during the training since themodel is still under-fitting and the sampled tokens will be tooerroneous Inspired by OR-NMT [Zhang et al 2019] we de-sign a similar exponential decay function for sampling prob-ability p(middot) but with more restrictions The decay function isset as

p(s) =

1 s le αmax(β micro

micro+exp((sminusα)micro) ) otherwise (6)

where s represents the training step α β and micro are hyperpa-rameters The hyperparameter α means the step when modelstarts to sample tokens and hyperparameter β is the maxi-mum probability for sampling

4 Experimental SettingIn this section we introduce the experimental settings to eval-uate our proposed method including datasets model config-uration training and evaluation

41 DatasetsWe conduct experiments on three IWSLT translation datasets(German Spanish Hebrew rarr English) and two WMTtranslation datasets (English rarr German Romanian) toevaluate our method In the follow sections we abbrevi-ate English German Spanish Hebrew Romanian as ldquoEnrdquoldquoDerdquo ldquoEsrdquo ldquoHerdquo ldquoRordquo

IWSLT datasets For IWSLT14 DerarrEn it contains160K and 7K sentence pairs in training set and valid setWe concatenate TEDtst2010 TEDtst2011 TEDtst2012

TEDdev2010 and TEDtst2012 as the test set For IWLST14EsrarrEn and HerarrEn 1 they contain 180K and 150K bilingualdata for training We choose TEDtst2013 as the valid set andTEDtst2014 as the test set During the data preprocess welearn a 10K byte-pair-coding (BPE) [Sennrich et al 2016] tohandle the vocabularyWMT datasets WMT14 EnrarrDe and WMT16 EnrarrRotranslation tasks contain 45M and 28M bilingual data fortraining Following previous work [Vaswani et al 2017] weconcatenate newstest2012 and newstest2013 as the valid setand choose newstest2014 as the test set for WMT14 EnrarrDeFor WMT16 EnrarrRo we choose newsdev2016 as the validset and newstest2016 as the test set We learn 32K and40K BPE codes to tokenize WMT14 EnrarrDe and WMT16EnrarrRo dataset

42 Model ConfigurationWe choose the state-of-the-art Transformer [Vaswani et al2017] as the default model For IWSLT tasks we use 6 Trans-former blocks where attention heads hidden size and filtersize are 4 512 and 1024 Dropout is set as 03 for IWSLTtasks The parameter size is 39M For WMT tasks we use 6Transformer blocks where attention heads hidden size andfilter size are 16 1024 and 4096 And the parameter size is as214M Dropout is set as 03 and 02 for EnrarrDe and EnrarrRorespectively To make a fair comparison we also list someresults by the original NMT model without two-stream self-attention For the decay function of sampling probability weset a b and micro as 30000 085 and 5000 The λ for LECM

is tuned on the valid set and a optimal choice is 10 Tomanifest the advances of our method we also prepare somestrong baselines for reference including Layer-wise Trans-former [He et al 2018] MIXER [Ranzato et al 2016] onTransformer and Tied-Transformer [Xia et al 2019]

43 Training and EvaluationDuring training we use Adam [Kingma and Ba 2014] as thedefault optimizer with a linear decay of learning rate TheIWSLT tasks are trained on single NVIDIA P40 GPU for100K steps and the WMT tasks are trained with 8 NVIDIAP40 GPUs for 300K steps where each GPU is filled with4096 tokens During inference We use beam search to de-code results The beam size and length penalty is set as 5 and10 for each task except WMT14 EnrarrDe which use a beamsize of 4 and length penalty is 06 by following the previouswork [Vaswani et al 2017] All of the results are reportedby multi-bleu 2 Our code is implemented on fairseq [Ott etal 2019] 3 and we will release our code under this linkhttpsgithubcomStillKeepTryECM-NMT

5 ResultsIn this section we report our result on three IWSLT tasksand two WMT tasks Furthermore we also study each hyper-

1IWSLT datasets can be download from httpswit3fbkeuarchive2014-01texts

2httpsgithubcommoses-smtmosesdecoderblobmasterscriptsgenericmulti-bleuperl

3httpsgithubcompytorchfairseq

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3894

Method DerarrEn EsrarrEn HerarrEn

Tied Transformer 3510 4051Layer-Wise Transformer 3507 4050 -MIXER 3530 4230 -

Transformer Baseline 3478 4178 3532Our method 3570 4305 3649

Table 1 BLEU score on IWSLT14 Translation tasks in differentsetting

Method EnrarrDe EnrarrRo

Tied Transformer 2898 3467Layer-wise Transformer 2901 3443MIXER 2868 3410

Transformer Baseline 2840 3290Our method 2920 3470

Table 2 BLEU score on WMT14 EnrarrDe and WMT16 EnrarrRo

parameter used in our model and conduct ablation study toevaluate our method

51 Results on IWSLT14 Translation TasksThe results of IWLST14 tasks are reported in Table 1 FromTable 1 we find our model with correction mechanism out-performs baseline by 089 099 and 083 points and the orig-inal NMT baseline by 092 127 and 117 points on DerarrEnEsrarrEn and HerarrEn respectively Note that our baseline isstrong enough which is comparable to the current advancedsystems Even within such strong baselines our method stillsachieves consistent improvements in all three tasks Theseimprovements also confirm the effectiveness of our methodin correcting error information

52 Results on WMT Translation TasksIn order to validate the performance of our method on large-scale datasets we also conduct experiments on WMT14EnrarrDe and WMT16 EnrarrRo The results are reported inTable 2 We found when incorporating error correction mech-anism into the NMT model it can achieve 2920 and 3470BLEU score which outperforms our baseline by 08 and 16points in EnrarrDe and EnrarrRo and is comparable to the pre-vious works These significant improvements on two large-scale datasets also demonstrate the effectiveness and robust-ness of our method in solving exposure bias In additionour approach is also compatible with previous works ieour method can achieve better performance if combined withother advanced structure

53 Ablation StudyTo demonstrate the necessity of each component in ourmethod we take a series of ablation study on our modelon IWSLT14 DerarrEn EsrarrEn and WMT14 EnrarrDe Theresults are shown in Table 3 When disabling error cor-rection mechanism (ECM) we find the model accuracy de-creases 030 048 and 050 points respectively in different

Method DerarrEn EsrarrEn EnrarrDe

Our method 3570 4305 2920-ECM 3540 4255 2872-ECM -SS 3481 4216 2848-ECM -SS -TSSA 3478 4181 2840

Table 3 Ablation study of different component in our model Thesecond third and forth column are results on IWSLT14 DerarrEnEsrarrEn and WMT14 EnrarrDe The second row is our method Thethird row is equal to the second row removing error correction mech-anism (ECM) The forth row is equal to the third row removingscheduled sampling (SS) The last raw is equal to the forth row re-moving two-stream self-attention ie the standard NMT The prefixldquo-rdquo means removing this part

tasks When further removing scheduled sampling (SS) themodel accuracy drop to 3481 4116 2848 points in threetasks We observe that in the large-scale dataset the improve-ments for scheduled sampling are limited while our modelstill achieves stable improvements in the large-scale datasetwhich proves the effectiveness of the error correction mech-anism In addition we also make a comparison between theoriginal NMT and the NMT with two-stream self-attention(TSSA) to verify whether two-stream self-attention mecha-nism will contribute to model accuracy improvement FromTable 3 we find the NMT model with TSSA performs slightlybetter than the original NMT model in the small-scale tasksby 003-035 points and is close to the accuracy of the originalNMT model in the large-scale tasks This phenomenon alsoexplains the improvement of our method is mainly brought bythe error correction rather than the two-stream self-attentionIn summary every component all plays an indispensable rolein our model

54 Study of λ for LECM

To investigate the effect of λ in LECM on model accuracy weconduct a series of experiments with different λ on IWSLT14DerarrEn and WMT14 EnrarrDe datasets The results areshown in Figure 3 It can be seen that the accuracy improveswith the growth of λ When λ ge 06 the model accuracy in-creases slowly The model achieves best accuracy at λ = 09in IWSLT14 DerarrEn and λ = 10 in WMT14 EnrarrDe To

(a) IWSLT14 DerarrEn (b) WMT14 EnrarrDe

Figure 3 Results on IWSLT14 DerarrEn and WMT14 EnrarrDe withdifferent λ

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3895

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 5: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

Method DerarrEn EsrarrEn HerarrEn

Tied Transformer 3510 4051Layer-Wise Transformer 3507 4050 -MIXER 3530 4230 -

Transformer Baseline 3478 4178 3532Our method 3570 4305 3649

Table 1 BLEU score on IWSLT14 Translation tasks in differentsetting

Method EnrarrDe EnrarrRo

Tied Transformer 2898 3467Layer-wise Transformer 2901 3443MIXER 2868 3410

Transformer Baseline 2840 3290Our method 2920 3470

Table 2 BLEU score on WMT14 EnrarrDe and WMT16 EnrarrRo

parameter used in our model and conduct ablation study toevaluate our method

51 Results on IWSLT14 Translation TasksThe results of IWLST14 tasks are reported in Table 1 FromTable 1 we find our model with correction mechanism out-performs baseline by 089 099 and 083 points and the orig-inal NMT baseline by 092 127 and 117 points on DerarrEnEsrarrEn and HerarrEn respectively Note that our baseline isstrong enough which is comparable to the current advancedsystems Even within such strong baselines our method stillsachieves consistent improvements in all three tasks Theseimprovements also confirm the effectiveness of our methodin correcting error information

52 Results on WMT Translation TasksIn order to validate the performance of our method on large-scale datasets we also conduct experiments on WMT14EnrarrDe and WMT16 EnrarrRo The results are reported inTable 2 We found when incorporating error correction mech-anism into the NMT model it can achieve 2920 and 3470BLEU score which outperforms our baseline by 08 and 16points in EnrarrDe and EnrarrRo and is comparable to the pre-vious works These significant improvements on two large-scale datasets also demonstrate the effectiveness and robust-ness of our method in solving exposure bias In additionour approach is also compatible with previous works ieour method can achieve better performance if combined withother advanced structure

53 Ablation StudyTo demonstrate the necessity of each component in ourmethod we take a series of ablation study on our modelon IWSLT14 DerarrEn EsrarrEn and WMT14 EnrarrDe Theresults are shown in Table 3 When disabling error cor-rection mechanism (ECM) we find the model accuracy de-creases 030 048 and 050 points respectively in different

Method DerarrEn EsrarrEn EnrarrDe

Our method 3570 4305 2920-ECM 3540 4255 2872-ECM -SS 3481 4216 2848-ECM -SS -TSSA 3478 4181 2840

Table 3 Ablation study of different component in our model Thesecond third and forth column are results on IWSLT14 DerarrEnEsrarrEn and WMT14 EnrarrDe The second row is our method Thethird row is equal to the second row removing error correction mech-anism (ECM) The forth row is equal to the third row removingscheduled sampling (SS) The last raw is equal to the forth row re-moving two-stream self-attention ie the standard NMT The prefixldquo-rdquo means removing this part

tasks When further removing scheduled sampling (SS) themodel accuracy drop to 3481 4116 2848 points in threetasks We observe that in the large-scale dataset the improve-ments for scheduled sampling are limited while our modelstill achieves stable improvements in the large-scale datasetwhich proves the effectiveness of the error correction mech-anism In addition we also make a comparison between theoriginal NMT and the NMT with two-stream self-attention(TSSA) to verify whether two-stream self-attention mecha-nism will contribute to model accuracy improvement FromTable 3 we find the NMT model with TSSA performs slightlybetter than the original NMT model in the small-scale tasksby 003-035 points and is close to the accuracy of the originalNMT model in the large-scale tasks This phenomenon alsoexplains the improvement of our method is mainly brought bythe error correction rather than the two-stream self-attentionIn summary every component all plays an indispensable rolein our model

54 Study of λ for LECM

To investigate the effect of λ in LECM on model accuracy weconduct a series of experiments with different λ on IWSLT14DerarrEn and WMT14 EnrarrDe datasets The results areshown in Figure 3 It can be seen that the accuracy improveswith the growth of λ When λ ge 06 the model accuracy in-creases slowly The model achieves best accuracy at λ = 09in IWSLT14 DerarrEn and λ = 10 in WMT14 EnrarrDe To

(a) IWSLT14 DerarrEn (b) WMT14 EnrarrDe

Figure 3 Results on IWSLT14 DerarrEn and WMT14 EnrarrDe withdifferent λ

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3895

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 6: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

Method Translation

Source (De) in dem moment war es als ob ein filmregisseur einen buhnenwechsel verlangt hatte

Target (En) at that moment it was as if a film director called for a set change

Baseline it was like a movie director had demanded a shift at the moment

Baseline + SS at the moment it was like a filmmaker had requested a change

Our method at the moment it was as if a movie director had called for a change

Table 4 A translation case on IWSLT14 DerarrEn test set generated by the baseline method baseline with scheduled sampling and ourmethod with error correction The italic font means the mismatch translation

(a) Study of α (b) Study of β

Figure 4 Results on IWSLT14 DerarrEn with different start steps αfor token sampling and maximum sampling probabilities β

avoid the heavy cost in tuning hyperparameters for differenttasks we limit λ as 10 in all of final experiments

55 Study of Sampling Probability p(middot)In this subsection we study how the hyperparameters α and βin sampling probability p(middot) will affect the model accuracy towhat extent We conduct experiments on IWSLT14 DerarrEndataset to investigate when to start scheduled sampling (α)and the best choice of maximum sampling probability (β)As can be seen from Figure 4 we have the following obser-vations

bull From Figure 4a it can be seen that starting to sample to-kens before 10K steps results in worse accuracy than thebaseline which is consistent with the previous hypoth-esis as mentioned in Section 32 that sampling tokenstoo early will affect the accuracy After 10K steps ourmethod gradually surpasses the baseline and achievesthe best accuracy at 30K steps After 30K steps themodel accuracy drops a little In summary α = 30Kis an optimal choice to start token sampling

bull From Figure 4b the model accuracy decreases promptlyno matter when the maximum sampling probability istoo big (ie the sampled tokens is close to the groundtruths) or too small (ie the sampled tokens are close tothe predicted tokens) This phenomenon is also consis-tent with our previous analysis in Section 32 Thereforewe choose β = 085 as the maximum sampling proba-bility

56 Cases StudyTo better understand the advantages of our method in correct-ing error tokens we also prepare some translation cases inIWSLT14 DerarrEn as shown in Table 4 It can be seen thatthe baseline result deviates the ground truth too much whichonly predicts five correct tokens When further adding sched-uled sampling the quality of translation can be improvedbut still generates error tokens like ldquolike a filmmaker hadrequestedrdquo which cause the following error prediction Fi-nally when using error correction mechanism we find thatthe translation produced by our method is closer to the targetsequence with only tiny errors In our translation sentence al-though a mismatched token ldquomovierdquo is predicted our modelcan still correct the following predicted sequence like ldquocalledfor a changerdquo The quality of our translation also confirmsthe effectiveness of our model in correcting error informationand alleviating error propagation

6 ConclusionIn this paper we incorporated a novel error correction mecha-nism into neural machine translation which aims to solve theerror propagation problem in sequence generation Specifi-cally we introduce two-stream self-attention into neural ma-chine translation and further design a error correction mech-anism based on two-stream self-attention which is able tocorrect the previous predicted errors while generate the nexttoken Experimental results on three IWSLT tasks and twoWMT tasks demonstrate our method outperforms previousmethods including scheduled sampling and alleviates theproblem of error propagation effectively In the future we ex-pect to apply our method on other sequence generation taskseg text summarization unsupervised neural machine trans-lation and incorporate our error correction mechanism intoother advanced structures

AcknowledgmentsThis paper was supported by The National Key Re-search and Development Program of China under Grant2017YFB1300205

References[Bahdanau et al 2014] Dzmitry Bahdanau Kyunghyun

Cho and Yoshua Bengio Neural machine translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3896

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion
Page 7: Neural Machine Translation with Error Correction · Neural machine translation (NMT) [Bahdanau et al., 2014; Sutskever et al., 2014; Vaswani et al., 2017; Song et al., 2018; Hassan

by jointly learning to align and translate arXiv e-printsabs14090473 September 2014

[Bengio et al 2015] Samy Bengio Oriol Vinyals NavdeepJaitly and Noam Shazeer Scheduled sampling for se-quence prediction with recurrent neural networks InNIPS pages 1171ndash1179 2015

[Cho et al 2014] Kyunghyun Cho Bart van MerrienboerCaglar Gulcehre Dzmitry Bahdanau Fethi BougaresHolger Schwenk and Yoshua Bengio Learning phraserepresentations using RNN encoderndashdecoder for statisticalmachine translation In EMNLP oct 2014

[Devlin et al 2019] Jacob Devlin Ming-Wei Chang Ken-ton Lee and Kristina Toutanova BERT Pre-training ofdeep bidirectional transformers for language understand-ing In NAACL pages 4171ndash4186 2019

[Hassan et al 2018] Hany Hassan Anthony Aue ChangChen Vishal Chowdhary Jonathan Clark Christian Fe-dermann Xuedong Huang Marcin Junczys-DowmuntWilliam Lewis Mu Li et al Achieving human parityon automatic chinese to english news translation arXivpreprint arXiv180305567 2018

[He et al 2018] Tianyu He Xu Tan Yingce Xia Di He TaoQin Zhibo Chen and Tie-Yan Liu Layer-wise coordi-nation between encoder and decoder for neural machinetranslation In NIPS pages 7944ndash7954 2018

[Kingma and Ba 2014] Diederik P Kingma and Jimmy BaAdam A method for stochastic optimization In Interna-tional Conference for Learning Representations 2014

[Ott et al 2019] Myle Ott Sergey Edunov Alexei BaevskiAngela Fan Sam Gross Nathan Ng David Grangier andMichael Auli fairseq A fast extensible toolkit for se-quence modeling In NAACL June 2019

[Radford et al 2019] Alec Radford Jeffrey Wu RewonChild David Luan Dario Amodei and Ilya SutskeverLanguage models are unsupervised multitask learnersOpenAI Blog 1(8) 2019

[Ranzato et al 2016] MarcrsquoAurelio Ranzato Sumit ChopraMichael Auli and Wojciech Zaremba Sequence leveltraining with recurrent neural networks In ICLR 2016

[Sennrich et al 2016] Rico Sennrich Barry Haddow andAlexandra Birch Neural machine translation of rare wordswith subword units In ACL August 2016

[Song et al 2018] Kaitao Song Xu Tan Di He JianfengLu Tao Qin and Tie-Yan Liu Double path networks forsequence to sequence learning In COLING pages 3064ndash3074 August 2018

[Song et al 2019] Kaitao Song Xu Tan Tao Qin JianfengLu and Tie-Yan Liu Mass Masked sequence to se-quence pre-training for language generation In ICMLpages 5926ndash5936 2019

[Sutskever et al 2014] Ilya Sutskever Oriol Vinyals andQuoc V Le Sequence to sequence learning with neuralnetworks In Advances in Neural Information ProcessingSystems 27 pages 3104ndash3112 2014

[Tan et al 2019] Xu Tan Yingce Xia Lijun Wu and TaoQin Efficient bidirectional neural machine translationarXiv preprint arXiv190809329 2019

[Vaswani et al 2017] Ashish Vaswani Noam Shazeer NikiParmar Jakob Uszkoreit Llion Jones Aidan N GomezŁ ukasz Kaiser and Illia Polosukhin Attention is all youneed In NIPS pages 5998ndash6008 2017

[Venkatraman et al 2015] Arun Venkatraman MartialHebert and J Andrew Bagnell Improving multi-stepprediction of learned time series models In AAAI 2015

[Wu et al 2018] Lijun Wu Xu Tan Di He Fei Tian TaoQin Jianhuang Lai and Tie-Yan Liu Beyond error prop-agation in neural machine translation Characteristics oflanguage also matter arXiv preprint arXiv1809001202018

[Wu et al 2019] Lijun Wu Xu Tan Tao Qin Jianhuang Laiand Tie-Yan Liu Beyond error propagation Languagebranching also affects the accuracy of sequence genera-tion IEEEACM Transactions on Audio Speech and Lan-guage Processing 27(12)1868ndash1879 2019

[Xia et al 2019] Yingce Xia Tianyu He Xu Tan Fei TianDi He and Tao Qin Tied transformers Neural machinetranslation with shared encoder and decoder In AAAIpages 5466ndash5473 2019

[Yang et al 2019] Zhilin Yang Zihang Dai Yiming YangJaime Carbonell Russ R Salakhutdinov and Quoc VLe Xlnet Generalized autoregressive pretraining for lan-guage understanding In NIPS pages 5754ndash5764 2019

[Zhang et al 2019] Wen Zhang Yang Feng FandongMeng Di You and Qun Liu Bridging the gap betweentraining and inference for neural machine translation InACL pages 4334ndash4343 July 2019

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

3897

  • Introduction
  • Background
  • Method
    • NMT with Two-Stream Self-Attention
    • Error Correction based on Two-Stream Self-Attention
      • Experimental Setting
        • Datasets
        • Model Configuration
        • Training and Evaluation
          • Results
            • Results on IWSLT14 Translation Tasks
            • Results on WMT Translation Tasks
            • Ablation Study
            • Study of for LECM
            • Study of Sampling Probability p()
            • Cases Study
              • Conclusion