Recurrent neural networks for statistical machine translation (Neural Machine Translation) ·...

65
Recurrent neural networks for statistical machine translation (Neural Machine Translation)

Transcript of Recurrent neural networks for statistical machine translation (Neural Machine Translation) ·...

Page 1: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Recurrent neural networks for statistical machine translation

(Neural Machine Translation)

Page 2: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Outline

● Machine translation● Statistical machine translation● Traditional methods● Why Neural Machine Translation (NMT)● General sequence learning with RNNs● Encoder decoder approach● Attention-based NMT● Using monolingual corpora in NMT● Character-based NMT

Page 3: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Machine Translation

● Translate a source sentence F to a target sentence E

● Define set of rules (Rule based translation)

● We don’t even know the set of rules underlying a single language

● Statistical approaches attempts to automatically extract rules from a large corpus of text, either implicitly or explicitly

Page 4: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Statistical Machine Translation

Corpora

e=(economic, growth, has, slowed, down, in, recent, years, .)

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Translation function

Page 5: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Traditional methods

Page 6: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Word-Based Model

● Words may not be the best candidates for the smallest units

● One word may translate to multiple words and vice versa● No word in target translation corresponding to a source

word and vice versa

the house is very small

das Hous ist klitzeklein

Page 7: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Phrase-Based Model

● Translate small word sequences at a time● Foreign input is segmented in phrases● Each phrase is translated into English● Phrases are reordered

of course john has fun with the game

naturerlich hat john spass am spiel

Page 8: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Phrase-Based Model (cont.)

● Fundamental equation of SMT

– Translation model

– Language model

● Decomposition of the translation model

– Phrase translation probability φ

– Reordering probability d

ebest=argmaxe p (e∣f )=argmax e p( f 1I∣e1

I) pLM (e)

p( f 1I∣e1

I )=∏i=1

I

φ(f i∣ei)d (starti−end i−1−1)

p( f 1I∣e1

I)

pLM (e )

Page 9: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Phrase Translation Table

● Main knowledge source: table with phrase translations and their probabilities

● Example: phrase translations for natuerlich

Translation Probability φ(e|f)

of course 0.5

naturally 0.3

of course , 0.15

, of course , 0.05

Page 10: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Weighted Model

● Some sub-models may be more important than others

● Add weights

ebest=argmaxe∏i=1

I

φ( f i∣e i)d (starti−end i−1−1)∏i=1

|e|

pLM (e i∣e1…e i−1)

ebest=argmaxe∏i=1

I

φ( f i∣e i)λφd (start i−end i−1−1)

λd∏i=1

|e|

pLM (e i∣e1…e i−1)λLM

λϕ , λd , λLM

Page 11: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Weighted Model as Log-Linear Model

p(e , a∣f )=exp {λφ∑i=1

I

log φ( f i∣e i)

+λd∑i=1

I

log d (ai−bi−1−1)

+λLM∑i=1

I

log pLM (e i∣e1…e i−1) }

Page 12: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Log-Linear Model

● Such a weighted model is a log-linear model:

● Feature functions– number of feature function

– random variable

– feature function

– feature function

– feature function

● More features can be added

p(x )=exp∑i=1

n

λ ihi(x )

x=(e , f , start , end)n=3

h1=log φh2=log dh3=log pLM

Page 13: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Problems with traditional approaches

● Highly Memory Intensive● Initial alignment makes conditional independence

assumption● Word and Phrase translation models only count co-

occurrences of surface form – don’t take word similarity into account

● Highly non-trivial to decode phrase based translation– word alignments + lexical reordering model

– language model

– phrase translations

Page 14: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Evaluation

● BLEU (BiLingual Evaluation Understudy)

● N-gram overlap between machine translation output and reference translation

● Compute precision for n-grams of size 1 to 4

● Add brevity penalty (for too short translations)

BLEU=min (1,output−length

reference−length )(∏i=1

4

precisioni)14

Page 15: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Example

SYSTEM A: Israeli officials responsibility of airport safety

REFERENCE: Israeli officials are responsible for airport security

SYSTEM B: airport security Israeli officials are responsible

2-GRAM MATCH 1-GRAM MATCH

2-GRAM MATCH 4-GRAM MATCH

Metric System A System B

precision (1gram) 3/6 6/6

precision (2gram) 1/5 4/5

precision (3gram) 0/4 2/4

precision (4gram) 0/3 1/3

brevity penalty 6/7 6/7

BLEU 0% 52%

Page 16: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Multiple Reference Translations

● To account for variability, use multiple reference translations

– n-grams may match in any of the references

– closest reference length used● Example

Israeli officials responsibility of airport safety

Israeli officials are responsible for airport securityIsrael is in charge of the security at this airportThe security work for this airport is the responsibility of the Israel governmentIsraeli side was in charge of the security of this airport

REFERENCES::

SYSTEM::2-GRAM MATCH 2-GRAM MATCH 1-GRAM

Page 17: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

General sequence learning using RNNs (Text analysis example)

Page 18: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Unfolding in time

Edge to next time step

W W

V

U

W

V

W W W

x t−1

o t−1

st−1

U

V

x t

o t

st

U

V

x t+1

o t+1

st+1

U

V

x

o

s

Weights are tied

Page 19: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

How RNN works1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

e=(economic, growth, has, slowed, down, in, recent, years, .)

softmax

+ive

For example predict sentiment of the input document

-ive

Page 20: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 1: Word to one-hot vector1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

e=(economic, growth, has, slowed, down, in, recent, years, .)

Re

curr

ent

Sta

te

(1)

Page 21: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 2: One-hot vector to continuous-space representation

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(2)

e=(economic, growth, has, slowed, down, in, recent, years, .)

s i=Ed× K x i

Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation

Page 22: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 3: Sequence summarization by RNN1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(3)

e=(economic, growth, has, slowed, down, in, recent, years, .)

h i=φθ (hi−1 , si)

Sequence of continuous vectors s_i summarized by RNN

Page 23: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Architecture is flexible1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 24: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Summary sentence representation vectors

Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together

Page 25: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 4: Use summarization for prediction1

-of-

K c

od

ing

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

e=(economic, growth, has, slowed, down, in, recent, years, .)

softmax

+ive -ive

Page 26: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Range of applications

image classification

fixed-sized input to fixed-sized output

Image: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

image captioning

Sequence output

sentiment analysis

Sequence input

Machine Translation

Sequence input and sequence output

video description

Synced sequence input and output

Page 27: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Neural Machine Translation

Inputsentence

Outputsentence

Neuralnet

Inputsentence

Outputsentence

Neural net

Inputsentence

Outputsentence

SMT

Neural net

SMT

(Schwenk et al. 2006) (Devlin et al. 2014)

Different from

SMT+Reranking-by-NN SMT-NN

Neural Machine Translation

(Ñeco&Forcada, 1997)

(Kalchbrenner et al., 2013)

(Cho et al., 2014)

(Sutskever et al., 2014)

Page 28: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Encoder-Decoder Architecture for Machine Translation

Page 29: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Encoder-Decoder Approach

Inputsentence

Encoder

Fixed sizerepresentation

DecoderOutput

sentence

(Ñeco&Forcada, 1997)

(Kalchbrenner et al., 2013)

(Cho et al., 2014)

(Sutskever et al., 2014)

RNN Encoder-Decoder (Cho et al. 2014)

h1 h2 hL

s0 s1 sT−1

Encoder

Decoder

Representation

...

...

y1 yTy2

x1 xLx2

Page 30: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

En

cod

erD

ecod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Rep

rese

nta

tio

nR

ecu

rre

nt

Sta

teR

ecu

rren

tS

tate

Wo

rd

Pro

bab

ility

Wo

rd S

amp

le

W i

S i

z i

hi

Pi

U i

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Neural machine translation system

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 31: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

The Encoder

Page 32: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 1: Word to one-hot vectorE

nco

der

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(1)

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Page 33: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 2: One-hot vector to continuous-space representation

En

cod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(2)

s i=Ed× K x i

Projects the 1-of-K coded vector with a matrix E to d-dimensional (typically 100 – 500) continuous word representation

s_i updated to maximize the translation performance

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Page 34: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Step 3: Sequence summarization by RNNE

nco

der

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Re

curr

ent

Sta

te

(3)

h i=φθ (hi−1 , si)

Sequence of continuous vectors s_i summarized by RNN

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Page 35: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Summary sentence representation vectors

Sentence Representations from [Sutskever et al., 2014]. Similar sentences are close together

Page 36: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

The Decoder

Page 37: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 1: Compute internal hidden state of decoder

(1)

z i=φθ '(hT , ui−1 , zi−1)

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 38: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 2: Next word probability

(2)

e ( k)=wkT z i+bk

p(w i=k ∣w1,w2,… ,w i−1 , hT)=exp(e ( k))

∑ jexp(e ( j))

Softmax

Score and normalize target words

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 39: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

P

rob

abili

tyW

ord

Sam

ple

z i

Pi

U i

Step 3: Sample next word

(3)

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 40: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Training

Page 41: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Training: Maximum Likelihood Estimation

● Given parallel corpus D of training examples

● Compute

● Log-likelihood of training corpus

● Maximize log-likelihood using stochastic gradient descent (SGD)

D={(x1, y1), (x2, y2) ,…,(xn , yn)}

log P( yn∣xn ,θ)

L(D ,θ)=1N∑n=1

N

logP ( yn∨xn ,θ) ,

Page 42: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Problem with simple encoder-decoder architectures

Page 43: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Soft Attention Mechanism for Neural Machine Translation

Page 44: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Bidirectional recurrent neural networks for encoding a source sentence

En

cod

er

1-o

f-K

co

din

g

Co

nti

nu

ou

s-sp

ace

Wo

rd

Re

pre

sen

tati

on

W i

S i

hi

Bid

irec

tio

nal

Rec

urr

ent

Sta

te

f=(La, croissance, économique, s'est, ralenti, ces, dernières, années, .)

Page 45: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Attention Mechanism

● To decide i-th target word, calculate relevance score between i-th target word and every source word

● Attention mechanism is implemented by a neural network

Page 46: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui

Attention Mechanism (step 1)

hJ

Att

enti

on

M

ech

anis

m

a j

Attention Weight ∑ a j=1

An

no

tati

on

V

ecto

r

(1)e j=W a zi−1+U a h j

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 47: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui

Attention Mechanism (step 2)

hJ

Att

enti

on

M

ech

anis

m

a j

Attention Weight ∑ a j=1

An

no

tati

on

V

ecto

r

(2) α j=exp (e j)

∑ j 'exp( e j ')

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 48: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deco

der

Rec

urr

ent

Sta

teW

ord

Sam

ple

z i

ui

Attention Mechanism (step 3)

hJ

Att

enti

on

M

ech

anis

m

a j

Attention Weight ∑ a j=1

An

no

tati

on

V

ecto

r

c i=∑ j=1

Tα j h j

e=(economic, growth, has, slowed, down, in, recent, years, .)

Page 49: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Effective Approaches to Attention-based Neural Machine Translation (Minh-Thang Luong et al.2015)

Page 50: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Global attentional model

Global align weights

Context vector

Attention Layer

hs ht

~h t

y t

c t

a t

Page 51: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Local attentional model

Local weights

Context vector

Attention Layer

hs ht

~h t

y t

c t

a t

Aligned position

pt

Page 52: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Input-feeding approach

A B C D <eos> X Y Z

<eos>YX Z

Attention Layer

~h t

Page 53: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Results

WMT’14 English-German results

Page 54: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Results (cont)

WMT’15 English-German results

Page 55: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Word-Character Models

Page 56: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Another Character-based NMT

Character-based word embedding

Marta R. Costa-juss` a and Jos e A. R. Fonollosa

Page 57: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Yet another (fully) Character-Level NMT

Fully Character-Level Neural Machine Translation without Explicit Segmentation (Nov 2016)Jason Lee, Kyunghyun Cho, Thomas Hofmann

Page 58: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Using monolingual corpora in NMT(C. Gulcehre et al., 2015)

Page 59: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Integrating Language Model into the decoder

● Pre-train NMT model (on parallel corpora) and (RNNLM (Mikolov et al., 2011), on larger monolingual corpora) separately

● Shallow Fusion● At each time step:

– TM proposes a set of candidate words– TM score: “Candidate sentences” + TM scores of candidate words– Select top-K– rescore these hypotheses with the weighted sum of the scores by the

NMT and RNNLM– The score of the new word

log p( y t=k )=log pTM ( y t=k )+β log pLM ( y t=k )

Page 60: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Shallow fusion (diagram)

stTM

stLM

y t

c t

y≤(i )

β

“Language modelrescoring”

Shallow Fusion

“Candidate sentences” +Translation model scores

Page 61: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deep fusion

● Deep Fusion– integrate the RNNLM and the decoder of the NMT by

concatenating their hidden states next to each other

– controller mechanism (gate)

p( y t∣y <t , x )∝exp( y tT(W o f o(s t−1

TM , st−1LM , y t−1 , c t)+bo))

gt=σ(vgT st

LM+bg)

Page 62: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Deep fusion (diagram)

s tTM

stLM

y t

c t

y≤(i )

β

“Language modelrescoring”

Shallow Fusion

“Candidate sentences” +Translation model scores

stTM

s tLM

y t

c t

Deep Fusion

+

gt

st+1TM

s t+1LM

y t+1

c t+1

+

gt+1

Page 63: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Results

SMS/CHAT CTS

Dev Test Dev Test

PB 15.5 14.73 21.94 21.68

+CSLM 16.02 15.25 23.05 22.79

HPB 15.33 14.71 21.45 21.43

+CSLM 15.93 15.8 22.61 22.17

NMT 17.32 17.36 23.4 23.59

Shallow 16.59 16.42 22.7 22.83

Deep 17.58 17.64 23.78 23.5

Results on the task of Zh-En

De-En Cs-En

Dev Test Dev Test

NMT 25.51 23.61 21.47 21.89

Shallow 25.53 23.69 21.95 22.18

Deep 25.88 24 22.49 22.36

Results for De-En and Cs-En translation tasks on WMT’15 dataset.

Page 64: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

Results (cont)

Development setTest set

dev2010

tst2010 tst2011 tst2012 tst2013 Test 2014

Previous Best (Single) 15.33 17.14 18.62 18.88 18.88 -

Previous Best (Combination)

- 17.34 18.93 18.7 18.7 -

NMT 14.5 18.01 18.77 19.86 19.86 18.64

NMT+LM (Shallow) 14.44 17.99 18.8 19.87 19.87 18.66

NMT+LM (Deep) 15.69 19.34 20.23 21.34 21.34 20.56

Results on Tr-En

Future work: domain adaption of the language model may further improve translations.

Performance improvement from incorporating an external LM was highly dependent on the domain similarity between the monolingual corpus and the target task.

Page 65: Recurrent neural networks for statistical machine translation (Neural Machine Translation) · 2016-11-14 · Recurrent neural networks for statistical machine translation (Neural

References● Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning

to align and translate.” arXiv preprint arXiv:1409.0473 (2014).● Cho, Kyunghyun et al. “Learning phrase representations using RNN encoder-decoder for statistical

machine translation.” arXiv preprint arXiv:1406.1078 (2014).● Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. “Describing Multimedia Content using Attention-

based Encoder–Decoder Networks.” arXiv preprint arXiv:1507.01053 (2015).● Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines.” arXiv preprint arXiv:1410.5401

(2014).● Gulcehre, Caglar et al. “On Using Monolingual Corpora in Neural Machine Translation.” arXiv preprint

arXiv:1503.03535 (2015).● Kalchbrenner, Nal, and Phil Blunsom. “Recurrent Continuous Translation Models.” EMNLP 2013: 1700-

1709.● Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.● Pascanu, Razvan et al. “How to construct deep recurrent neural networks.” arXiv preprint

arXiv:1312.6026 (2013).● Schwenk, Holger. “Continuous space language models.” Computer Speech & Language 21.3 (2007):

492-518.● Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.”

Advances in Neural Information Processing Systems 2014: 3104-3112.● http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/● Minh Thang Luong, Hieu Pham, Christopher D. Manning“Effective Approaches to Attention-based Neural

Machine Translation.” arXiv:1508.04025 (2015).● Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing, “BLEU: a method for

automatic evaluation of machine translation”, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002