Context-Aware Smoothing for Neural Machine Translation · Context-Aware Smoothing for Neural...

Context-Aware Smoothing for Neural Machine TranslationKehai Chen1, Rui Wang2, Masao Utiyama2, Eiichiro Sumita2 and Tiejun Zhao1

1Harbin Institute of Technology, China

2National Institute of Information and Communications Technology, Japan

The 8th International Joint Conference on Natural Language Processing

Content

• Motivation

• Related Works

• Context-Aware Representation

• NMT with Context-Aware Smoothing

• Experimental Results

• Conclusion

11/28/2017 1

Motivation-1:polysemy words他们tamen

想xiang

通过tongguo

打da

比赛bisai

来lai

解决jiejue

矛盾maodun

They want to solve the dispute by playing the game

Src1

(pinyin)

Src2

(pinyin)

Trg1

They are beating each other for a disputeTrg2

他们tamen

正在zhengzai

因为yinwei

争执zhengzhi

而er

打da

对方duifang

打da

vplaying

vbeating

Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context

vda

11/28/2017 2

Motivation-1:polysemy words他们tamen

想xiang

通过tongguo

打da

比赛bisai

来lai

解决jiejue

矛盾maodun

They want to solve the dispute by playing the game

Src1

(pinyin)

Src2

(pinyin)

Trg1

They are beating each other for a disputeTrg2

他们tamen

正在zhengzai

因为yinwei

争执zhengzhi

而er

打da

对方duifang

打da

vplaying

vbeating

𝒉𝒋 = 𝒇𝒆𝒏𝒄(𝒗𝒋, 𝒉𝒋−𝟏)

Learn specific-sentence word representation 𝒗𝒋

Better source representation… …Better target translation

Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context

vda

Motivation-1: Enhancing word representation for polysemy words

11/28/2017 3

Motivation-2:OOV

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

xJ

𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5

𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5

x4

…

…

…x5

v4

x3

v3

x2

v2

x1

v1 v5 … vJ

Attention α

En

cod

er-D

eco

der

NM

T

Word embedding layer

Encoder layer

Decoder layer

11/28/2017 4

Motivation-2:OOV

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

xJ

𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5


x4

…

…

…OOV

v4

x3

v3

x2

v2

x1

v1 vu … vJ

Attention α

En

cod

er-D

eco

der

NM

T

• The source sentence includes a OOV

Single vector vu represents all OOVs

11/28/2017 5

Motivation-2:OOV

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

xJ

𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5


x4

…

…

…OOV

v4

x3

v3

x2

v2

x1

v1 vu … vJ

Attention α

En

cod

er-D

eco

der

NM

T

• Breaking the structure of sentence;• Pool source representation;• … …• Affecting translation prediction of target

word.

These gray parts indicate the parameters of NMT which are affected by the OOV

• The source sentence includes a OOV

Single vector vu represents all OOVs

11/28/2017 6

Related Works

• Translation Granularity for NMT

---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.

• Source representation for NMT

---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.

Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …

Bahdanau et al. (2015), Sutskever et al. (2014), … …

11/28/2017 7

Related Works

• Translation Granularity for NMT

---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.

• Source representation for NMT

• This work focus on enhancing word embedding layer.

---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.

---Learning a specific-sentence representation for polysemy or OOV word by its context words.

---Offering context-aware representation enhances word embedding layer, thereby improving translations (though RNN Encoder can capture word context).

Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …

Bahdanau et al. (2015), Sutskever et al. (2014), … …

11/28/2017 8

Context-Aware Representation

x4 unkx3x2x1 x9x8x7x6

If there is an OOV “unk” (or polysemy word) in the sentence:

v4v3v2v1 v9v8v7v6

Context information

When one understands natural language sentence intuitively, especially including OOV or polysemy word, one often inferences the meaning of these words depending on its context words.

11/28/2017 9


• We define a context Lj for source word xj in a fixed size window 2n:

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛

Historical n words Future n words

11/28/2017 10


x4 x5x3x2x1 xJ...x7x6

• We define a context Lj for source word xj in a fixed size window 2n:

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛

• Take x5 as an example, its context 𝐿5 follows (n=2):

𝐿5 = 𝑥3, 𝑥4, 𝑥6, 𝑥7

Historical n words Future n words

11/28/2017 11

Feedforward Context-of-Words Model (FCWM)

Output layer:

Input layer:

Context words Lj of xj :

concatenation

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2

𝑽Lj


𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]

𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏

Concatenation:

𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)

𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1)11/28/2017 12

Feedforward Context-of-Words Model (FCWM) Convolutional Context-of-Words Model (CCWM)

Output layer:

Input layer:


concatenation

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2

𝑽Lj

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2


𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

Input layer:


Convolution layer:

Non-linear output layer:

Pooling layer:

𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]

𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏

Concatenation:

𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

𝓜= [𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏]

𝓛𝒋 = 𝜓(𝑾𝟐ℳ+𝒃𝟐)

𝓟𝒍 = max[𝓛𝟐𝒍−𝟏, 𝓛𝟐𝒍]

𝓛 = [𝓛𝟏, … , 𝓛𝟐𝒏−𝒌+𝟏]

𝓟 = max[𝓟𝟏, … ,𝓟𝟐𝒏−𝒌+𝟏𝟐

]

𝓥𝓛𝒋 = 𝜎(𝑾𝟑(𝒂𝒗𝒆(σ𝒍=𝟏

𝟐𝒏−𝒌+𝟏

𝟐 𝓟𝒍))+ 𝒃𝟑)

𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1) 𝓥𝓛𝒋 = 𝜑2 (𝐿𝑗; 𝜃2)11/28/2017 13

𝒉𝒋 = ቐ𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏 , 𝑥𝑗 ∈ 𝑉𝑠𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏), 𝑥𝑗 ∉ 𝑉𝑠

𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏

Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)

Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = ቐ𝑔 𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∈ 𝑉𝑡

𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∉ 𝑉𝑡

𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5

𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦u

xJx4 …

v4

x5

v3

x2

v2

x1

v1vLu

φe(Lu)

… vJ

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

xu

φd(Lu)

𝑉𝐿𝑢′

Attention α

…

…

• CARNMT-Enc

En

cod

er-D

ecoder

NM

TNMT for OOV Smoothing

11/28/2017 14



Standard NMT :

This work:


Standard NMT :

This work:



𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5


xJx4 …

v4

x5

v3

x2

v2

x1

v1v3 … vJ

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

x3

φd(Lu)

𝑉𝐿𝑢′

Attention α

…

…

• CARNMT-Dec

En

cod

er-D

ecoder

NM


11/28/2017 15



Standard NMT :

This work:


Standard NMT :

This work:



𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5


xJx4 …

v4

x5

v3

x2

v2

x1

v1vLu

φe(Lu)

… vJ

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

xu

φd(Lu)

𝑉𝐿𝑢′

Attention α

…

…

• CARNMT-Both

En

cod

er-D

ecoder

NM


11/28/2017 16

NMT for Smoothing all words

𝒉2

𝒉2

𝒉3

𝒉3

𝒉4

𝒉4

𝒉J

𝒉J

…

…

𝒉1

𝒉1

𝒉5

𝒉5

𝒔2

𝒄2

𝒔3

𝒄3

𝒔4

𝒄4

𝒔𝐼

𝒄𝐼

…

…

𝒔1

𝒄1

𝒔5

𝒄5


…

…

…

Attention α

x5

v5 : vL5

φe(L5)

x4

v4: vL4

φe(L4)

x3

v3: vL3

φe(L3)

x2

v2: vL2

φe(L2)

x1

v1: vL1

φe(L1)

xJ

vJ: vLJ

φe(LJ)

φd(…)

…:…

φe(…)

v𝑦2

:v𝐿2′ v

𝑦3:v𝐿3

′v1

:v𝐿1′ v

𝑦4:v𝐿4

′ v𝑦5

:v𝐿5′ v

𝑦𝐼:v𝐿𝐼

′… :…

)𝜑𝑑(𝐿1′ )𝜑𝑑(𝐿2

′ )𝜑𝑑(𝐿3′ )𝜑𝑑(𝐿5

′ )𝜑𝑑(𝐿I′)𝜑𝑑(𝐿4

′

𝒉𝒋 = 𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏)


Standard NMT :

This work:


Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = 𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊

• CARNMT-ALL

En

cod

er-D

ecoder

NM

T

11/28/2017 17

Experimental Settings• Training data includes 1.42M Chinese-to-English parallel sentence pairs from LDC

corpus.

• The NIST 2002 (MT02) and NIST 2003-2008 (MT03-08) datasets are as validation set

and test sets, respectively. The Case-insensitive 4-gram NIST BLEU score (Papineni et al.,

2002) is as evaluation metric.

• Vocab is 30k; Sentence length is 80; Mini-batch size 80; Word embedding dim is 620;

Hidden layer dim is 1000; Dropout on the all layers; Optimizer is Adadelta.

• The baseline includes: Standard Attentional NMT (Bahdanau et al., 2014); Subword-based

NMT (Sennrich et al., 2016); Character-based NMT (Costa-jussa and Fonollosa, 2016);

Replacing unk with similarity semantic in vocabulary words (Li et al., 2016).

11/28/2017 18

Experimental Results• Results for Chinese-to-English Translation Task

• Moses VS NMT ------> Strong baselines• CARNMT-Enc/Dec VS Bahdanau et al. (2015) ------> Our method can effectively smooth the negative effect (Motivation 1)• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)

11/28/2017 19


• CARNMT-Enc/Dec VS Bahdanau et al. (2015) Our smooth method can relieve the negative effect of OOV effectively, as in Motivation 2

• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)

11/28/2017 20


• CARNMT-Both VS CARNMT-Enc/Dec Source-side smoothing is orthogonal with target-side smoothing

11/28/2017 21


• ALLSmooth VS CARNMT-Both In-vocabulary smoothing is also beneficial for NMT (Motivation 1)

11/28/2017 22


• FCWM VS CCWM The CCWM learns the context semantic representation directly for smoothing word vector, while the FCWM predicts

semantic representation of word depending on its context.

11/28/2017 23

Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV

2306 1827 1121 678 391 215 123 59 37 24 29

• The number of OOV = 0 • ALLSmooth is better than the baseline

Bahdanau et al. (2015).

• Both of CARNMT-Enc/Dec are similar to

baseline Bahdanau et al. (2015).

• With the increasing in the number of OOVs• The gap between our methods and other

methods (except PBSMT) become larger,

especially when more than five.

• When the number of OOV is more than seven• PBSMT is better than all NMT models

The number of sentences

11/28/2017 24


2306 1827 1121 678 391 215 123 59 37 24 29










11/28/2017 25


2306 1827 1121 678 391 215 123 59 37 24 29










11/28/2017 26


2306 1827 1121 678 391 215 123 59 37 24 29










11/28/2017 27


• The negative effect of OOV exists in NMT• The OOV “jiyuqi” itself is not translated.• The phrase “lizheng yousuo zuowei” (the red part in English) is not translated.

• Smoothing the negative effect of OOV• Obtaining the translation “striving to accomplish something” of “lizheng yousuo zuowei”.

11/28/2017 28

Conclusion• Experimental results showed that the negative effect of OOV decreased

the translation performance of NMT, and the existing RNN encoder can

not adequately address the problem.

• The learned CAR was integrated into the Encoder to smooth word

representation, and thus enhanced the Decoder of NMT.

• Experimental results showed that the proposed method can greatly

alleviate the negative effect of OOV and enhance word representation of

in-vocabulary words, thus improving the translations.

11/28/2017 29

Q&AThanks

11/28/2017 30

Context-Aware Smoothing for Neural Machine Translation · Context-Aware Smoothing for Neural...

Documents

Transcript of Context-Aware Smoothing for Neural Machine Translation · Context-Aware Smoothing for Neural...