Improving Neural Machine Translation with External...

12
Improving Neural Machine Translation with External Information Thesis Proposal Jindˇ rich Helcl Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostransk´ e n´ amˇ est´ ı 25, 118 00 Prague, Czech Republic [email protected] Abstract Results of previous studies in the field suggest that neural machine translation (NMT) models can achieve better perfor- mance by exploiting information from ex- ternal sources. In this thesis proposal, we categorize a variety of types of the en- hanced models into four classes – mul- timodal, multilingual, linguistically co- supervised, and linguistically inspired. We summarize some of the approaches in each of these categories found in the litera- ture. A part of this work is a devel- opment of a sequence-to-sequence learn- ing toolkit designed for fast prototyping of various kinds of experiments. We re- port results of experiments proposed by us in the past, which were mainly based on the multimodal models. In the future, we plan to focus on the linguistically co- supervised models, which could make use of the abundance of linguistic annotation available. 1 Introduction With the advent of deep learning, the model ar- chitectures we use for natural language process- ing (NLP) tasks have simplified to such level that they do not use any external information, be it an expert knowledge (e.g. linguistic annotations) or an equivalent information expressed in a differ- ent modality (e.g. photos or sound). This means that we are losing valuable sources of linguistic information which according to many studies re- tain their value in a number of NLP tasks, such as machine translation (Sennrich and Haddow, 2016; Eriguchi et al., 2017), question answering (Zhang et al., 2017), or sentiment analysis (Matsumoto et al., 2005). In this text, we categorize the models that exploit the external information to multimodal, multi-source, linguistically co-supervised, and lin- guistically inspired. We present a selected number of works from each of those categories. We present our contributions to the category of multimodal models. We show the results of the ex- periments conducted in the previous years, mainly as submissions to the shared task at Conference on Machine Translation (WMT). For our future research, we plan to contribute to the category of linguistically co-supervised mod- els applied to neural machine translation (NMT). The main feature of linguistically co-supervised models is the use of linguistic annotations in the training process, rather than imposing hard con- straints to the network architecture itself. This can be achieved for example in the multi-task learning scenario. We see a particular particularly promising re- search direction in using our in-house annotated corpora, such as Prague Czech-English Depen- dency Treebank (Hajiˇ c et al., 2012), that may show the usefulness of these high-value resources. The text of this proposal is structured as follows. In Section 2, we describe the standard models used for NMT. Section 3 gives an overview of existing methods of incorporating linguistic information to deep learning models. Section 4 sums up the ex- periments already conducted and presents their re- sults. In Section 5, we present a number of ideas for the following experiments. We conclude in Section 6. 2 Neural Machine Translation The goal of NMT is to train a neural network to read a sentence in the source language and gen- erate a corresponding sentence in the target lan- guage. In a broader context, this task can be

Transcript of Improving Neural Machine Translation with External...

Page 1: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

Improving Neural Machine Translation with External InformationThesis Proposal

Jindrich HelclCharles University, Faculty of Mathematics and Physics

Institute of Formal and Applied LinguisticsMalostranske namestı 25, 118 00 Prague, Czech Republic

[email protected]

Abstract

Results of previous studies in the fieldsuggest that neural machine translation(NMT) models can achieve better perfor-mance by exploiting information from ex-ternal sources. In this thesis proposal, wecategorize a variety of types of the en-hanced models into four classes – mul-timodal, multilingual, linguistically co-supervised, and linguistically inspired. Wesummarize some of the approaches in eachof these categories found in the litera-ture. A part of this work is a devel-opment of a sequence-to-sequence learn-ing toolkit designed for fast prototypingof various kinds of experiments. We re-port results of experiments proposed byus in the past, which were mainly basedon the multimodal models. In the future,we plan to focus on the linguistically co-supervised models, which could make useof the abundance of linguistic annotationavailable.

1 Introduction

With the advent of deep learning, the model ar-chitectures we use for natural language process-ing (NLP) tasks have simplified to such level thatthey do not use any external information, be it anexpert knowledge (e.g. linguistic annotations) oran equivalent information expressed in a differ-ent modality (e.g. photos or sound). This meansthat we are losing valuable sources of linguisticinformation which according to many studies re-tain their value in a number of NLP tasks, such asmachine translation (Sennrich and Haddow, 2016;Eriguchi et al., 2017), question answering (Zhanget al., 2017), or sentiment analysis (Matsumotoet al., 2005).

In this text, we categorize the models thatexploit the external information to multimodal,multi-source, linguistically co-supervised, and lin-guistically inspired. We present a selected numberof works from each of those categories.

We present our contributions to the category ofmultimodal models. We show the results of the ex-periments conducted in the previous years, mainlyas submissions to the shared task at Conference onMachine Translation (WMT).

For our future research, we plan to contribute tothe category of linguistically co-supervised mod-els applied to neural machine translation (NMT).The main feature of linguistically co-supervisedmodels is the use of linguistic annotations in thetraining process, rather than imposing hard con-straints to the network architecture itself. This canbe achieved for example in the multi-task learningscenario.

We see a particular particularly promising re-search direction in using our in-house annotatedcorpora, such as Prague Czech-English Depen-dency Treebank (Hajic et al., 2012), that may showthe usefulness of these high-value resources.

The text of this proposal is structured as follows.In Section 2, we describe the standard models usedfor NMT. Section 3 gives an overview of existingmethods of incorporating linguistic information todeep learning models. Section 4 sums up the ex-periments already conducted and presents their re-sults. In Section 5, we present a number of ideasfor the following experiments. We conclude inSection 6.

2 Neural Machine Translation

The goal of NMT is to train a neural network toread a sentence in the source language and gen-erate a corresponding sentence in the target lan-guage. In a broader context, this task can be

Page 2: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

Figure 1: Scheme of the classic sequence-to-sequence model by Sutskever et al. (2014)

viewed as an instance of a sequence-to-sequence(S2S) learning problem (Sutskever et al., 2014).

In S2S learning, first, the source sequence X =(x1, . . . , xTx) is encoded using a recurrent neuralnetwork (RNN) in a fixed-length vector hTx . Next,in each step, the output of the previous step is fedto the decoding RNN, which updates its hiddenstate si−1 and outputs a new state si and a newoutput symbol yi. The step procedure is repeateduntil a special symbol <eos> is emitted.

Since plain RNNs can suffer from the vanishinggradients problem, they were replaced in the firstNMT experiments with Long Short-Term Mem-ory networks (LSTM; Hochreiter and Schmidhu-ber, 1997). Another variant of the RNN which wasused in our experiment is the Gated Recurrent Unitnetwork (GRU, Cho et al., 2014). Both of thesevariants introduce a gating system which can han-dle vanishing gradients problem in long-range de-pendencies through additive memory updates withlinear derivatives.

The following equations describe a GRU-basedsequence-to-sequence model:

hj = GRUenc(hj−1;Eenc xj), (1)

s0 = tanh(VshTx + bs), (2)

si = GRUdec(si−1;Edec yi−1) (3)

where h0, . . . , hTx are hidden states of the encoder(h0 being a null-vector), Eenc and Edec are the so-called embedding matrices, which assign a real-valued vector to each word in the vocabulary, Vsand bs are additional trainable parameters of thelinear projection of hTx , and s0, . . . , sTy are hid-den states of the decoder. As y0, we use a spe-cial <go> symbol that denotes the start of the sen-tence. Tx and Ty are the lengths of the source andtarget sequence respectively. We use semicolon todenote vector concatenation.

The actual output word is selected based on the

output of the decoder network:

ti =Wo(si;Edec yi−1) + bo, (4)

P (yi|x,y<i) ∝ exp ti, (5)

yi = argmaxy

P (y|x,y<i). (6)

One of the most significant improvements overthe S2S baseline model is introduction of the at-tention mechanism (Bahdanau et al., 2014). Sim-ilarly to content-based addressing in Neural Tur-ing Machines (Graves et al., 2014), the attentionmechanism allows us to use context-sensitive in-formation in the decoding phase.

In each decoder step i, the output of the at-tention mechanism is a context vector ci, whichis then concatenated to the input of the decoderRNN. The Equations 3 and 4 are thus adjusted towork with the context vector as follows:

si = GRUdec(si−1; ci;Edecyi−1). (7)

ti =Wo(si; ci;Edec yi−1) + bo. (8)

In the i-th step of the decoder, the context vec-tor ci is defined as a weighted sum of the encoderstates using the attention distribution αij , which isin turn obtained by normalizing the attention ener-gies eij :

eij = v>a tanh(Wasi + Uahj), (9)

αij =exp(eij)∑Txk=1 exp(eik)

, (10)

ci =

Tx∑j=1

αijhj . (11)

The trainable parameters Wa and Ua are projec-tion matrices that transform the decoder and en-coder states si and hj into a common vector spaceand va is a weight vector over the dimensions ofthis space. For the sake of clarity, bias terms (ap-plied every time a vector is linearly projected us-ing a weight matrix) are omitted.

Page 3: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

En: A wall divided the city.De 1: Eine Wand teilte die Stadt. ×De 2: Eine Mauer teilte die Stadt. X

Figure 2: An illustration of disambiguation usingthe visual features. Example taken from Speciaet al. (2016).

The model is trained by minimizing a loss func-tion defined over the output probability distribu-tion. A commonly-used loss function for NMTmodels is the negative log-likelihood:

L(θ) = − 1

N

N∑i=1

logP (yi|xi,y<i, θ) (12)

where N is the number of training examples, andθ is the set of all trainable parameters.

3 Related Work

There is a significant amount of evidence that sug-gests using external information can help improveNMT.

In this section, we divide these studies into fourcategories according to the type of the externalinformation: First, Section 3.1 deals with mul-timodal translation – exploiting information pro-vided as an additional modality, specifically im-ages. Second, Section 3.2 describes approachesbased on using linguistic annotations. Third, Sec-tion 3.4 shows results of multilanguage translationmodels. Finally, Section 3.3 present studies thatare linguistically inspired, although do not use thelinguistic annotation data directly.

3.1 Multimodal TranslationMultimodal translation is a task of generating thetarget sentence given a source sentence and an ad-ditional information in a different modality. In thissection, we will focus on translation models en-hanced with images. This means our task is to

translate an image caption from one language toanother, provided we can access the image itself.

Elliott et al. (2015) argue that including the im-age features in the model architecture can help thetranslation system to disambiguate between differ-ent meanings. Figure 2 gives an example wherevisual features provide such information for dis-ambiguation.

In their work, they employ a S2S model asdescribed in Equations 1–6 which is providedwith the visual features obtained from the penul-timate layer of the VGG-16 object recognitionnetwork (Simonyan and Zisserman, 2014). Ineach step of the encoder and/or encoder, they in-clude the visual feature vector among the inputsof the RNN (i.e. they concatenate the visual fea-ture vector with the rest of the inputs in Equa-tions 1 and 3). Although they argue the mod-els benefit from the added modality, their bestmodel did not outperform the textual baseline interms of BLEU/METEOR (Papineni et al., 2002;Denkowski and Lavie, 2011).

Xu et al. (2015) introduced attention mecha-nism used for image captioning – a task of gener-ating a textual description of an image. Instead ofattending to a set of states of an RNN encoder, theyemploy the same technique over the componentsof the last convolutional layer of the VGG-16 net-work. This way, they were able to extract context-sensitive image features relevant for a given RNNdecoder state.

In their WMT16 submission, Caglayan et al.(2016a) tried to combine the attention distributioninstead of combining the resulting context vectors.They perform the weighted sum from Equation 11over the states from both modalities:

ci =

Tx∑j=1

αtxtij hj +

196∑j=1

αimgij vj (13)

where vj is the j-th vector in the last convolutionallayer (which contains 196 such vectors that corre-spond to the image down-scaled to 14-by-14 re-gions). In this case, they made a strong assump-tion that the network can be trained in such a waythat the encoder states and the states of the con-volutional network occupy the same vector space.Therefore, the score of their multimodal MT sys-tem remained far below the text-only setup.

Soon after, Caglayan et al. (2016b) introducedthe multimodal attention mechanism. In their ver-sion, they compute the context vectors for both im-

Page 4: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

age and the source sentence in each decoder step.They adapt Equations 7 and 8 by supplying eitherthe sum or the concatenation of the single-modalcontext vectors as ci. In this work, they reportedslight improvements over the textual baseline onthe Multi30k dataset, which was made availablefor the WMT16 multimodal translation task (El-liott et al., 2016). A significant difference be-tween this work and the previous approaches isthat instead of the VGG-19 network, they use theResNet-50 network for extracting the visual fea-tures (He et al., 2016).

More recent state-of-the-art results on theMulti30k dataset were achieved by Calixto et al.(2017). The best-performing architecture uses thelast fully-connected layer of VGG-19 network forthe initialization of the decoder and attends only tothe states of the RNN (text) encoder.

Elliott and Kadar (2017) brought further im-provements by introducing the “imagination”component to the neural network architecture.Given the source sentence, the network is trainedto output the target sentence jointly with predict-ing the image vector. The model uses the visual in-formation only as a regularization and thus is ableto use additional parallel data without accompany-ing images.

3.2 Linguistically Co-Supervised Models

Training of linguistically co-supervised models re-quires an external source of linguistic annotation.The linguistic annotation itself can be of any kind.In most of the published studies, we encounterthe use of part-of-speech (POS) tags, syntax, ornamed entities.

Sennrich and Haddow (2016) introduce linguis-tic features (lemma, POS, and dependency label)as an additional input to the decoder, as previ-ously applied to neural language models (Alexan-drescu and Kirchhoff, 2006). In this scheme,each feature type (factor) has its own embed-ding matrix. The embedded factors are concate-nated together with the word embedding (Equa-tion 1). The rest of the architecture is identi-cal to the standard NMT model described above.The results on the WMT16 test data show that us-ing all factors together gives significant improve-ments over a strong baseline in terms of BLEUand chrF3 (Popovic, 2015). The study also showsa contrastive comparison of the effect of differentfeature types to the translation quality.

Similarly, Li et al. (2017) focus on providing thesource syntax information to the encoder. Theypropose three encoder models which operate onthe input sentences combined with sequences oflabels from linearized phrase-structure trees. First,parallel encoder is composed of two RNNs – oneprocesses the sequence of words, and the otherprocesses the the sequence of syntactic labels. Thestates of the word encoder are then concatenatedwith the syntax encoder states which correspondto the syntactic labels of the words (leaves of thetree). Second, hierarchical encoder is also di-vided into two RNNs. Unlike the parallel encoder,the hierarchical encoder first processes the parsetree and then concatenates the tree-leaf RNN stateswith embeddings of corresponding words. Afterthat, the second RNN is run over the updated em-beddings. Finally, mixed encoder is a single RNNthat operates on a (mixed) sequence of both syn-tactic labels and words. This sequence is orderedso the words are inserted to the network immedi-ately after their corresponding labels. The mixedencoder model is shown to have the best perfor-mance of all the proposed models. It outperformsthe RNNSearch model of Bahdanau et al. (2014)on NIST MT02–MT05 datasets.

Bastings et al. (2017) use so-called graph-convolutional networks in encoders. They usea dependency parser to preprocess the trainingdata so that each word has its dependency la-bels and a pointer to its head in the tree. Thegraph-convolutional encoder processes the inputsequence in multiple layers. In each layer, the in-formation from the dependent nodes is mixed withthe state of the head. This way, the informationfrom one node propagates through the dependencygraph layer-by-layer.

The following approaches employ multi-tasklearning techniques. The loss function from Equa-tion 12 can be changed to contain the joint prob-ability of predictions for all the different types oflabels.

Eriguchi et al. (2017) found that training themodel to parse and translate helps NMT. Theypresent a hybrid model between NMT and recur-rent neural network grammar (RNNG, Dyer et al.,2016). The model replaces the parser’s buffer withthe decoder state, which enables the decoder tocontrol the parser. For training the model, in-stead of using manually annotated parallel cor-pora, they use the SyntaxNet parser (Andor et al.,

Page 5: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

2016) and generate the training parse trees auto-matically. This model gains significant improve-ments over NMT baseline on WMT newstest2016on three of four language pairs that has been ex-perimented with.

Another example of multi-task learning is thework of Tamchyna et al. (2017). Their NMT sys-tem works in two steps. First, they train the systemto produce the lemma and POS tag of the targetword (in a serialized form). Second, they use a de-terministic generator to produce the target wordsfrom the lemmas and tags. In order to train thesystem, they preprocess the target side of the train-ing data with morphological analyzers and add thelemma and POS information to the training cor-pus. With this method, they obtain significant im-provements, especially in English to Czech trans-lation, where the target language is morphologi-cally rich.

Finally, Niehues and Cho (2017) systemati-cally explore the multi-task learning setups forthree tasks – machine translation, POS tagging,and named-entity recognition. They experimentwith three different levels of sharing model parts.First, the shared encoder model, uses a singleencoder for the source sentence and separate de-coders and attention mechanisms for each of thetasks. Second, the shared attention model, sharesnot only the encoder, but also the attention mecha-nism across the tasks. This is achieved by shar-ing the Ua and va parameters from Equation 9.Third, the shared decoder model shares the en-coder, the attention model, and the decoder for allthe tasks. The first part of the model that is notshared among task is therefore the Wo and bo pa-rameters of Equation 8.

3.3 Linguistically Inspired Models

There are more ways of exploiting linguistic in-formation than providing the training proceduredirectly with explicit annotations. The categoryof linguistically inspired models brings togethermethods that apply changes to the model architec-ture which are to some extent inspired by linguistictheory.

To tackle the problem of translating low-frequency words, Arthur et al. (2016) incorpo-rate translation lexicons into NMT. The probabili-ties from the translation lexicon are first convertedinto predictive probabilities over next word us-ing the attention distribution (Equation 10). They

present two ways of combining the lexicon predic-tive probabilities with the probabilities outputtedby the model. First, the model bias approach intro-duces a hyper-parameter that controls how muchthe model is biased towards relying on the lex-icon. Second, the linear interpolation approachuses a trainable parameter which controls the in-terpolation of the lexicon-based probabilities andthe output probabilities.

Another linguistically inspired approach issource-side latent graph parsing (Hashimoto andTsuruoka, 2017). In these experiments, a pre-trained LSTM dependency parser is used to pro-cess the input, and the parse tree is provided asinput of a recurrent decoder with attention mecha-nism. The parameters of the LSTM parser remaintrainable during training of the translation model,without using any more annotated data.

3.4 Multi-source NMT

Multi-source NMT is a task of translating a sen-tence (which can be in one of multiple sourcelanguages) to a sentence in a common target lan-guage.

One of the first experiments with multi-sourceNMT was conducted by Zoph and Knight (2016).They use a trilingual parallel corpus to train amodel that translates to English using a French andGerman source sentence. In their paper, however,they do not report scores of the model when oneof the source sentences is missing.

Johnson et al. (2016) present an extension of theGoogle NMT model (Wu et al., 2016) that workswith many source and target languages. They de-scribe separately the scenarios for many-to-one,one-to-many, and many-to-many translation mod-els. In these models, they include a special tokento the end of the source sentence which specifiesthe target language (so the encoder can adjust tothe information). Interestingly, they found that fora fixed language pair, adding data with a differentsource language can help the translation quality ofthe original language pair.

4 Experiments

This section presents experiments we conductedso far and discuss the results. In the first part,Section 4.1 describes our toolkit designed for fastprototyping of neural architectures. Section 4.2gives an overview of our models submitted tothe WMT 16 multimodal translation and post-

Page 6: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

editing tasks. Section 4.3 presents our method forcombining attention mechanisms in multi-sourcesetup. Section 4.4 presents the revisited modelsused in our submission to the WMT 17 multi-modal translation task.

4.1 Neural Monkey

We developed a toolkit for fast prototyping of neu-ral network models for our experiments. This sec-tion gives a brief description of the toolkit and itsmain features.

Neural Monkey (Helcl and Libovicky, 2017) isan open-source software implemented in Python3using the TensorFlow library (Abadi et al., 2016).It’s goal is to provide a higher level API whichenables users to quickly prototype and train newarchitectures without requiring knowledge of theimplementation details. For that reason we try touse as big abstract building blocks as possible.

Unlike other toolkits, like tfLearn1 or Lasagne,2

our building blocks are more abstract objects (e.g.encoders or classifiers) rather than individual net-work layers. These objects are parametrized sothat their actual properties (e.g. dimensionality ofembeddings or hidden states, dropout probability,number of layers) can be set from distant perspec-tive. This design decision allows us to have the ex-periment configuration placed elsewhere than theactual code, in a separate comprehensive configu-ration file.

Neural Monkey is still under development andits mission is to become an ever-growing collec-tion of recent innovations in S2S learning so itsusers are able to easily try out the models for theirspecific tasks and datasets. With its simple ex-periment management, it is used as a ready-madeeasily-extensible toolkit for experiments with ma-chine translation, image captioning, text classifi-cation tasks or scene text recognition.

4.2 WMT 16 Multimodal Translation andAutomatic Post-Editing Tasks

This section describes our submissions to theWMT 16 automatic post-editing and multimodaltranslation shared tasks (Libovicky et al., 2016).The goal of the automatic post-editing task is toautomatically correct a machine-generated trans-lation while having access also to the correspond-ing source-language sentence. As mentioned

1https://github.com/tflearn/tflearn2https://github.com/Lasagne/Lasagne

above, the goal of multimodal translation is totranslate an image description from one languageto another, with access to the image itself.

Our method uses the neural translation modelwith attention by Bahdanau et al. (2014) and ex-tends it to include an arbitrary number of encoders(see Figure 3). Each input sentence enters thesystem simultaneously in several representationsX(k). An encoder used for the k-th representationis either a bidirectional GRU network (for textualinput) or the VGG-16 convolutional neural net-work (CNN).

The initial state of the decoder is computed asa weighted combination of the final states of allthe encoders. As the final state of a CNN encoder,we use the activation vector from the penultimatelayer of the network.

The attention is computed over each encoderseparately as described in Equations 9, 10, and 11.The context vectors are concatenated prior com-puting the distribution over the output vocabulary.Here is a revisited version of Equation 8 that re-flects this approach:

ti =Wo(si;Edecyi−1; c(1); . . . ; c(k)) + bo. (14)

For the automatic post-editing shared task, weuse two encoders. First encoder processes thesource sentence and the second encoder processesthe machine-generated translation. Rather thantraining the model to generate the post-edited sen-tence directly, we trained the model to output a se-quence of edit operations. An edit operation couldbe either a word from the output vocabulary (inwhich case the word got inserted to the output), orone of two special symbols, keep or delete. In or-der to generate the final output, the edit operationsequence was applied to the machine-generatedtranslation using a deterministic procedure.

For the multimodal translation task, we firsttranslated the training dataset with Moses (Koehnet al., 2007). Then, we trained a model with threeencoders – one CNN encoder for the image (withfixed parameters), and two textual encoders for thesource image description and the Moses transla-tion.

Since the target language for both tasks wasGerman, we also did language dependent text pre-processing. Before training, we split the con-tracted prepositions and articles (am ↔ an dem,zur ↔ zu der, . . . ) and separated some pronounsfrom their case ending (keinem ↔ kein -em, un-serer ↔ unser -er, . . . ). We also tried splitting

Page 7: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

Figure 3: Multi-encoder architecture used for the multimodal translation.

compound nouns into smaller units, but on the rel-atively small data sets we have worked with, it didnot bring any improvement.

In this submission, we also experimented withsome improvements of S2S learning such asscheduled sampling (Bengio et al., 2015), noisyactivation function (Gulcehre et al., 2016), or lin-guistic coverage model (Tu et al., 2016). How-ever, none of these methods was able to improvethe performance of our systems.

4.3 Attention Combination Strategies forMulti-Source S2S Learning

In this section, we describe flat and hierarchicalattention combination strategies (Libovicky andHelcl, 2017). We employ this architecture in S2Slearning with multiple input sequences of variousmodalities and a single RNN decoder.

The prior approaches to multi-source S2S learn-ing do not explicitly model different importance ofthe inputs to the decoder (Firat et al., 2016; Zophand Knight, 2016). An example motivation sce-nario is multimodal translation, where we mightexpect the image description to be the primarysource of information, whereas the image featureswould help mainly with visual disambiguation.

We describe the two combination strategies inmore detail.

Flat attention combination In this strategy, weproject the states of all encoders into a commonvector space and then compute attention over theprojected vectors.

The difference between the concatenation of thecontext vectors (as seen e.g. in Caglayan et al.,2016b) and the flat attention combination is thatthe αi coefficients are computed jointly for all en-coders (modified Equation 10):

α(k)ij =

exp(e(k)ij )∑N

n=1

∑T(n)x

m=1 exp(e(n)im

) (15)

where T (n)x is the length of the n-th encoder in-

put sequence and e(k)ij is the attention energy of

the j-th state of the k-th encoder in the i-th decod-ing step. These attention energies are computedaccording to Equation 9 – the parameters va andWa in the equation are common for all the en-coders, whereas the matrix Ua is encoder-specificand serves as a projection matrix from each en-coder state space into a single shared vector space.

Since the states of the encoders occupy differentvector spaces, they can have different dimension-ality. Hence, the context vector cannot be com-puted as their weighted sum (Equation 11). There-fore, we project the encoder states into a single

Page 8: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

space using linear projections:

ci =N∑k=1

T(k)x∑

j=1

α(k)ij U

(k)c h

(k)j (16)

where U (k)c are additional trainable parameters.

The projection matrices U (k)a and U (k)

c projectthe states of one encoder into vector spaces withequal dimensionality. In our experiments, we triedboth setting these projection matrices equal andtraining them separately.

Hierarchical attention combination This com-bination strategy divides the computation of the at-tention distribution into two steps: First, it com-putes the context vector for each encoder in-dependently using Equations 9–11. Second, itprojects the context vectors into a shared vec-tor space (Equation 17), computes another distri-bution over the projected context vectors (Equa-tion 18), and their corresponding weighted aver-age (Equation 19):

e(k)i = v>b tanh(Wbsi + U

(k)b c

(k)i ), (17)

β(k)i =

exp(e(k)i )∑N

n=1 exp(e(n)i )

, (18)

ci =

N∑k=1

β(k)i U (k)

c c(k)i (19)

where c(k)i is the context vector of the k-th en-coder, additional trainable parameters vb and Wb

are shared for all encoders, and U (k)b and U (k)

c areencoder-specific projection matrices, that can beset either equal or trained independently, similarlyto the case of flat attention combination.

From all the proposed methods, the best per-forming was the hierarchical attention combina-tion with independently trained encoder projec-tions (matrices Ua and Uc). In this setting, wewere able to significantly outperform the concate-nation baseline.

Both the hierarchical and the flat combinationstrategies provide an explicit way to interpret dif-ferent importances of each inputs. In Figure 4,you can see the hierarchical attention distribution.The image also include the sentinel gate (Lu et al.,2016), which allows the decoder not to attend toany of the input encoders. Experimenting with thesentinel gate, however, did not bring any improve-ments and the details are out of the scope of thisproposal.

Source: a man sleeping in a green room on acouch .Reference: ein Mann schlaft in einem grunenRaum auf einem Sofa .Output with attention:

ein

Mann

schläft

auf

einem

grünen

Sofa

ineinem

grünen

Raum

.

(1)(2)(3)

(1) source, (2) image, (3) sentinel

Figure 4: Visualization of hierarchical attention inMMT. Each column in the diagram correspondsto the weights of the encoders and sentinel. Notethat the despite the overall low importance of theimage encoder, it gets activated for the contentwords.

4.4 WMT 17 Multimodal Translation Task

Our submission to this year’s WMT multimodalshared task (Helcl and Libovicky, 2017) employsthe hierarchical attention combination describedabove. Figure 5 shows the diagram of the model.

We expanded the training dataset using addi-tional data acquired from several sources. First,we back- translated German descriptions that werepart of the Multi30k dataset (Elliott et al., 2016).This gave us six times more training data becausebeside the German translations of the image de-scriptions, the Multi30k dataset also contains fiveindependent German descriptions for each image.Second, using a language model trained on theimage descriptions, we selected similar sentencepairs from the parallel SDEWAC corpus (Faaß andEckart, 2013) and German parts of WMT paral-lel corpora, such as EU Bookshop (Skadins et al.,2014), News Commentary (Tiedemann, 2012),and CommonCrawl (Smith et al., 2013).

We also reported a number of negative results.First, we tried to rescore the top-k best hypothe-ses using a multimodal classifier. Second, we

Page 9: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

image

sourcesentence

CNN

wi−1

GRUcell

GRUcell∑

αi �

1st attentionbidirectional RNN

pre-trained convnet

∑αi �

×β2

×β1

2nd attention wi

Encoders A Decoder Step

Figure 5: An overall picture of the multimodal model using hierarchical attention combination on theinput. Here, α and β are normalized coefficients computed by the attention models, wi is the i-th inputto the decoder.

switch the optimization criterion from negativelog-likelihood to optimize directly towards BLEUusing self-critical training (Rennie et al., 2016).Despite the current negative result, we believe thatthese methods may be relevant for future develop-ment in the field.

5 Future Work

In the following paragraphs, we outline the pro-posal of the future work. The main topic of theresearch is to explore ways of using external infor-mation in order to improve neural machine trans-lation.

Explore multi-task learning solutions. In ourfuture work, we aim to explore more multi-tasklearning schemes, which will serve as a platformfor models that learn to predict target-side linguis-tic features. Not only this has the potential to bebeneficial for the primary task (i.e. translation),but it can also show which architectures are ableto capture the linguistic information implicitly andwhich architectures need supervision from humanannotation.

Specialize on Czech annotated data. We willfocus our experiments on exploiting the abun-dance of high-quality annotated corpora availableat our institute. As we see in the literature, forthe English language, including the linguistic an-notation to the model has a great potential of im-proving the performance. We believe that the scale

of the improvement will be even higher for a lan-guage with rich morphology and non-projectivedependency structures, such as Czech. Moreover,the Czech language is a perfect candidate, sincelarge amounts of data are essential to conductstate-of-the-art deep learning experiments.

6 Conclusions

In this text, we explained the principles of neu-ral machine translation. We further categorizedthe related work exploits external information inthe translation architectures. We presented the ex-periments we conducted in the past and suggestedideas for future work.

In the future work proposal, we put accenton harnessing manually annotated datasets whichwere created at our institute in order to exploretheir potential in improving the translation quality.

References

Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, et al.2016. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprintarXiv:1603.04467 .

Andrei Alexandrescu and Katrin Kirchhoff. 2006.Factored neural language models. In Proceedingsof the Human Language Technology Conferenceof the NAACL, Companion Volume: Short Papers.Association for Computational Linguistics, Strouds-

Page 10: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

burg, PA, USA, NAACL-Short ’06, pages 1–4.http://dl.acm.org/citation.cfm?id=1614049.1614050.

Daniel Andor, Chris Alberti, David Weiss, Aliak-sei Severyn, Alessandro Presta, Kuzman Ganchev,Slav Petrov, and Michael Collins. 2016. Glob-ally normalized transition-based neural networks.In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers). Association for ComputationalLinguistics, Berlin, Germany, pages 2442–2452.http://www.aclweb.org/anthology/P16-1231.

Philip Arthur, Graham Neubig, and Satoshi Naka-mura. 2016. Incorporating Discrete Trans-lation Lexicons into Neural Machine Transla-tion. arXiv:1606.02006 [cs] ArXiv: 1606.02006.http://arxiv.org/abs/1606.02006.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473. http://arxiv.org/abs/1409.0473.

Joost Bastings, Ivan Titov, Wilker Aziz,Diego Marcheggiani, and Khalil Sima’an.2017. Graph Convolutional Encoders forSyntax-aware Neural Machine Translation.arXiv:1704.04675 [cs] ArXiv: 1704.04675.http://arxiv.org/abs/1704.04675.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam M. Shazeer. 2015. Scheduled sampling forsequence prediction with recurrent neural networks.In Advances in Neural Information Processing Sys-tems, NIPS. http://arxiv.org/abs/1506.03099.

Ozan Caglayan, Walid Aransa, Yaxing Wang,Marc Masana, Mercedes Garcıa-Martınez, FethiBougares, Loıc Barrault, and Joost van de Weijer.2016a. Does Multimodality Help Human andMachine for Translation and Image Captioning?arXiv:1605.09186 [cs] pages 627–633. ArXiv:1605.09186. https://doi.org/10.18653/v1/W16-2358.

Ozan Caglayan, Loıc Barrault, and Fethi Bougares.2016b. Multimodal Atention for Neural Ma-chine Translation. arXiv:1609.03976 [cs] ArXiv:1609.03976. http://arxiv.org/abs/1609.03976.

Iacer Calixto, Qun Liu, and Nick Campbell.2017. Incorporating global visual features intoattention-based neural machine translation. CoRRabs/1701.06521. http://arxiv.org/abs/1701.06521.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the proper-ties of neural machine translation: Encoder–decoderapproaches. In Proceedings of SSST-8, EighthWorkshop on Syntax, Semantics and Structure inStatistical Translation. Association for Computa-tional Linguistics, Doha, Qatar, pages 103–111.http://www.aclweb.org/anthology/W14-4012.

Michael Denkowski and Alon Lavie. 2011. Meteor1.3: Automatic metric for reliable optimization andevaluation of machine translation systems. In Pro-ceedings of the Sixth Workshop on Statistical Ma-chine Translation. Association for ComputationalLinguistics, Edinburgh, United Kingdom, pages 85–91. http://www.aclweb.org/anthology/W11-2107.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural net-work grammars. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies. Association for ComputationalLinguistics, San Diego, California, pages 199–209.http://www.aclweb.org/anthology/N16-1024.

Desmond Elliott, Stella Frank, and Eva Hasler. 2015.Multi-language image description with neural se-quence models. CoRR, abs/1510.04709 .

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30k: Multilingual english-german image descriptions. CoRR abs/1605.00459.http://arxiv.org/abs/1605.00459.

Desmond Elliott and Akos Kadar. 2017. Imagi-nation improves multimodal translation. CoRRabs/1705.04350. http://arxiv.org/abs/1705.04350.

Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to Parse and Trans-late Improves Neural Machine Translation.arXiv:1702.03525 [cs] ArXiv: 1702.03525.http://arxiv.org/abs/1702.03525.

Gertrud Faaß and Kerstin Eckart. 2013. Sdewac–a corpus of parsable sentences from the web. InLanguage processing and knowledge in the Web,Springer, pages 61–68.

Orhan Firat, Kyunghyun Cho, and Yoshua Ben-gio. 2016. Multi-way, multilingual neural ma-chine translation with a shared attention mecha-nism. In Proceedings of the 2016 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies. Association for Computational Lin-guistics, San Diego, CA, USA, pages 866–875.http://www.aclweb.org/anthology/N16-1101.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural turing machines. CoRR abs/1410.5401.http://arxiv.org/abs/1410.5401.

Caglar Gulcehre, Marcin Moczulski, Misha De-nil, and Yoshua Bengio. 2016. Noisy ac-tivation functions. CoRR abs/1603.00391.http://arxiv.org/abs/1603.00391.

Jan Hajic, Eva Hajicova, Jarmila Panevova, Petr Sgall,Silvie Cinkova, Eva Fucıkova, Marie Mikulova,Petr Pajas, Jan Popelka, Jirı Semecky, JanaSindlerova, Jan Stepanek, Josef Toman, ZdenkaUresova, and Zdenek Zabokrtsky. 2012. Prague

Page 11: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

czech-english dependency treebank 2.0. LIN-DAT/CLARIN digital library at the Institute ofFormal and Applied Linguistics, Charles Uni-versity. http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4.

Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017.Neural Machine Translation with Source-Side La-tent Graph Parsing. arXiv:1702.02265 [cs] ArXiv:1702.02265. http://arxiv.org/abs/1702.02265.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition. pages 770–778.

Jindrich Helcl and Jindrich Libovicky. 2017. Neuralmonkey: An open-source tool for sequence learn-ing. The Prague Bulletin of Mathematical Lin-guistics (107):5–17. https://doi.org/10.1515/pralin-2017-0001.

Jindrich Helcl and Jindrich Libovicky. 2017. Cunisystem for the wmt17 multimodal translation task.In Proceedings of the Second Conference on Ma-chine Translation. Association for ComputationalLinguistics, Copenhagen, Denmark, pages 450–457.http://www.aclweb.org/anthology/W17-4749.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput. 9:1735–1780.https://doi.org/10.1162/neco.1997.9.8.1735.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda Viegas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean.2016. Google’s Multilingual Neural MachineTranslation System: Enabling Zero-Shot Transla-tion. arXiv:1611.04558 [cs] ArXiv: 1611.04558.http://arxiv.org/abs/1611.04558.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine trans-lation. In Proceedings of the 45th Annual Meet-ing of the Association for Computational LinguisticsCompanion Volume Proceedings of the Demo andPoster Sessions. Association for Computational Lin-guistics, Prague, Czech Republic, pages 177–180.http://www.aclweb.org/anthology/P07-2045.

Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu,Min Zhang, and Guodong Zhou. 2017. Model-ing Source Syntax for Neural Machine Transla-tion. arXiv:1705.01020 [cs] ArXiv: 1705.01020.http://arxiv.org/abs/1705.01020.

Jindrich Libovicky and Jindrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics

(Volume 2: Short Papers). Association for Compu-tational Linguistics, Vancouver, Canada.

Jindrich Libovicky, Jindrich Helcl, Marek Tlusty,Ondrej Bojar, and Pavel Pecina. 2016. CUNIsystem for WMT16 automatic post-editingand multimodal translation tasks. In Pro-ceedings of the First Conference on MachineTranslation. Association for ComputationalLinguistics, Berlin, Germany, pages 646–654.http://www.aclweb.org/anthology/W/W16/W16-2361.

Jiasen Lu, Caiming Xiong, Devi Parikh, andRichard Socher. 2016. Knowing when tolook: Adaptive attention via a visual sentinelfor image captioning. CoRR abs/1612.01887.http://arxiv.org/abs/1612.01887.

Shotaro Matsumoto, Hiroya Takamura, and Man-abu Okumura. 2005. Sentiment classification us-ing word sub-sequences and dependency sub-trees.Springer.

Jan Niehues and Eunah Cho. 2017. Exploiting Linguis-tic Resources for Neural Machine Translation UsingMulti-task Learning. arXiv:1708.00993 [cs] ArXiv:1708.00993. http://arxiv.org/abs/1708.00993.

Kishore Papineni, Salim Roukos, Todd Ward,and Wei-Jing Zhu. 2002. Bleu: a methodfor automatic evaluation of machine transla-tion. In Proceedings of 40th Annual Meetingof the Association for Computational Linguis-tics. Association for Computational Linguistics,Philadelphia, Pennsylvania, USA, pages 311–318.https://doi.org/10.3115/1073083.1073135.

Maja Popovic. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceed-ings of the Tenth Workshop on Statistical Ma-chine Translation. Association for ComputationalLinguistics, Lisbon, Portugal, pages 392–395.http://aclweb.org/anthology/W15-3049.

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. 2016. Self-criticalsequence training for image captioning. CoRRabs/1612.00563. http://arxiv.org/abs/1612.00563.

Rico Sennrich and Barry Haddow. 2016. Linguis-tic Input Features Improve Neural Machine Trans-lation. arXiv:1606.02892 [cs] ArXiv: 1606.02892.http://arxiv.org/abs/1606.02892.

Karen Simonyan and Andrew Zisserman. 2014.Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.http://arxiv.org/abs/1409.1556.

Raivis Skadins, Jorg Tiedemann, Roberts Rozis, andDaiga Deksne. 2014. Billions of parallel words forfree: Building and using the eu bookshop corpus.In Proceedings of the 9th International Conferenceon Language Resources and Evaluation (LREC-2014). European Language Resources Association(ELRA), Reykjavik, Iceland.

Page 12: Improving Neural Machine Translation with External …ufal.mff.cuni.cz/~zabokrtsky/pgs/thesis_proposal/jindri...Improving Neural Machine Translation with External Information Thesis

Jason R. Smith, Herve Saint-Amand, Magdalena Pla-mada, Philipp Koehn, Chris Callison-Burch, andAdam Lopez. 2013. Dirt cheap web-scale par-allel text from the common crawl. In Pro-ceedings of the 51st Annual Meeting of the As-sociation for Computational Linguistics (Volume1: Long Papers). Association for ComputationalLinguistics, Sofia, Bulgaria, pages 1374–1383.http://www.aclweb.org/anthology/P13-1135.

Lucia Specia, Stella Frank, Khalil Sima’an,and Desmond Elliot. 2016. First sharedtask on multimodal machine translationand crosslingual image description. online.staff.fnwi.uva.nl/s.c.frank/mmt wmt slides.pdf.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, editors,Advances in Neural Information Processing Sys-tems 27, Curran Associates, Inc., pages 3104–3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.

Ales Tamchyna, Marion Weller-Di Marco, andAlexander Fraser. 2017. Modeling Target-Side Inflection in Neural Machine Translation.arXiv:1707.06012 [cs] ArXiv: 1707.06012.http://arxiv.org/abs/1707.06012.

Jorg Tiedemann. 2012. Parallel data, tools and in-terfaces in opus. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Thierry Declerck,Mehmet Ugur Dogan, Bente Maegaard, JosephMariani, Jan Odijk, and Stelios Piperidis, edi-tors, Proceedings of the Eight International Con-ference on Language Resources and Evaluation(LREC’12). European Language Resources Associ-ation (ELRA), Istanbul, Turkey.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, XiaohuaLiu, and Hang Li. 2016. Coverage-based neu-ral machine translation. CoRR abs/1601.04811.http://arxiv.org/abs/1601.04811.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. CoRRabs/1609.08144. http://arxiv.org/abs/1609.08144.

Kelvin Xu, Jimmy Ba, Ryan Kiros, KyunghyunCho, Aaron Courville, Ruslan Salakhudinov, RichZemel, and Yoshua Bengio. 2015. Show, at-tend and tell: Neural image caption gener-ation with visual attention. In David Blei

and Francis Bach, editors, Proceedings of the32nd International Conference on Machine Learn-ing (ICML-15). JMLR Workshop and Confer-ence Proceedings, Lille, France, pages 2048–2057.http://jmlr.org/proceedings/papers/v37/xuc15.pdf.

Junbei Zhang, Xiao-Dan Zhu, Qian Chen, Li-Rong Dai, Si Wei, and Hui Jiang. 2017. Ex-ploring question understanding and adaptation inneural-network-based question answering. CoRRabs/1703.04617. http://arxiv.org/abs/1703.04617.

Barret Zoph and Kevin Knight. 2016. Multi-sourceneural translation. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies. Association for ComputationalLinguistics, San Diego, CA, USA, pages 30–34.http://www.aclweb.org/anthology/N16-1004.