Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

24
A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

description

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance. Zhifei Li and Sanjeev Khudanpur Johns Hopkins University. J OS HU A: a scalable open-source parsing-based MT decoder. Written in JAVA language Chart-parsing Beam and Cube pruning - PowerPoint PPT Presentation

Transcript of Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Page 1: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model

State Maintenance

Zhifei Li and Sanjeev KhudanpurJohns Hopkins University

Page 2: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al.,

2007) Equivalent LM state maintenance We plan to add more functions soon

JOSHUA: a scalable open-source parsing-based MT decoder

New!

Chiang (2007)

Page 3: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Grammar formalism Synchronous Context-free Grammar (SCFG)

Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by

applying the inference rules to prove more and more items, until a goal item is proved.

The hypotheses are stored in a hypergraph.

Chart-parsing

Page 4: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Hypergraph

item

hyperedge

on the mat a cat

X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA

X | 0, 4 | a cat | the mat

X ( 猫 , a cat)

X | 0, 4 | the mat | a cat

X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)

(X0, X0) S

Goal Item

(X0, X0) S

X ( 垫子 上 , the mat)

X (X0 的 X1, X1 of X0)

X (X0 的 X1, X1 on X0)

猫 3垫子 0 上 1 的 2

of

Page 5: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Hypergraph and Trees

猫 3垫子 0 上 1 的 2

X ( 猫 , a cat) X ( 垫子 上 , the mat)

X (X0 的 X1, X0 X1)

(X0, X0) S

the mat a cat

X ( 猫 , a cat)

猫 3垫子 0 上 1 的 2

X ( 垫子 上 , the mat)

(X0, X0) S

X (X0 的 X1, X0 ’s X1)

the mat ’s a cat

X ( 猫 , a cat)

猫 3

a cat on the mat

垫子 0 上 1 的 2

X ( 垫子 上 , the mat)

(X0, X0) S

X (X0 的 X1, X1 on X0)

X ( 猫 , a cat)

猫 3垫子 0 上 1 的 2

X ( 垫子 上 , the mat)

(X0, X0) S

X (X0 的 X1, X1 of X0)

A cat of the mat

Page 6: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

How to Integrate an m-gram LM?

X | 0,1 | the olympic | olympic game

X ( 将 在 X0 举行。 , will be held in X0 .)

X (X0 的 X1, X1 of X0)

X ( 北京 , beijing) X ( 中国 , china)

X | 5, 6 | beijing | NA X | 3, 4 | china | NA

X | 3, 6 | beijing of | of china

X | 1, 7 | will be | china .

S (X0, X0)

S | 0, 1 | the olympic | olympic game

S (S0 X1, S0 X1)

S | 0, 7 | the olympic | china .

S (<s> S0 </s>, <s> S0 </s>)

S | 0, 7 | <s> the | . </s>

X ( 奥运会 , the olympic game)

北京 5奥运会 0 中国 3 的 4将 1 举行。 6在 2

the olympic game will be held in chinaofbeijing .

Three functions Accumulate probability Estimate future cost State extraction

0.4 0.2

New 3-gram• beijing of china

New 3-grams

• will be held• be held in• held in beijing

• in beijing of

0.04=0.4*0.2*0.5

0.5

Future prob• P(beijing of)=0.01

Estimated total prob• 0.01*0.04=0.004

Page 7: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Equivalent State Maintenance: overview

X ( 在 X0 的 X1 下 , below X1 of X0)

X | 0, 3 | below cat | some rat

X ( 在 X0 的 X1 下 , below X1 of X0)

X | 0, 3 | below cats | many rat

X ( 在 X0 的 X1 下 , under the X1 of X0) X ( 在 X0 的 X1 下 , below the X1 of X0)

X | 0, 3 | below * | * rat

In a straightforward implementation, different LM state words lead to different items

We merge multiple items into a single item by replacing some LM state words with asterisk wildcard

X ( 在 X0 的 X1 下 , under X1 of X0)

X | 0, 3 | under cat | some rat

X ( 在 X0 的 X1 下 , below X1 of X0)

X | 0, 3 | below cat | many rat

By merging items, we can explore larger hypothesis space using less time.

We only merge items when the length of English span l ≥ m-1

Page 8: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Back-off Parameterization of m-gram LMs LM probability computation

Observations A larger m leads to more backoff Default backoff weight is 1

For a m-gram not listed, β(.) = 1

-4.250922 party files

-4.741889 party filled

-4.250922 party finance -0.1434139

-4.741889 party financed

-4.741889 party finances -0.2361806

-4.741889 party financially

-3.33127 party financing -0.1119054

-3.277455 party finished -0.4362795

-4.012205 party fired

-4.741889 party fires

Page 9: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Equivalent State Maintenance: Right-side

state words State Prefix IS-A-PREFIX equivalent statefuture words

el-2 el-1 el el-2 el-1 el noel+1el+2el+3…

* el-1 el el-1 el noel+1el+2el+3…

• Why not right to left?

• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.

Independent from el-2

IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no

• For the case of a 4-gram LM P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el)

* el-1 el

* * el

* * el el noel+1el+2el+3… * * *

Backoff weight is one

Page 10: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Equivalent State Maintenance: Left-side

state words State Suffix IS-A-SUFFFIX equivalent statefuture words

e1 e2 e3 no… e-2e-1e0

e1 e2 * e1 e2 no

e1 e2 e3

Independent from e3

P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)

• For the case of a 4-gram LM P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2)

• Why not left to right?

• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.

… e-2e-1e0

e1 * * e1 no… e-2e-1e0 * * *

e1 * *

e1 e2 *

Finalized probability

P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0)

Remember to factor in backoff weights later

Page 11: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Equivalent State Maintenance: summary

Modified Cost FunctionOriginal Cost Function

Finalized

probability

Estimated probability

State extraction

Page 12: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Experimental Results: Decoding Speed System Training

Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs

obtain 570k sentence pairs LM training data: Gigaword and English side of bitext

Decoding speed Number of rules: 3M Number of m-grams: 49M

38 times faster than the baseline!

Page 13: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Experimental Results: Distributed LM Distributed Language Model

Eight 7-gram LMs Decoding speed: 12.2 sec/sent

Page 14: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance

Sparse LM: a 7-gram LM built on about 19M words

Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method.

Backoff happens less frequently Inefficient suffix/prefix information lookup

5070

90 120150 200

30

Page 15: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Summary

We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding

millions of sentences in a large-scale discriminative training task

We propose a method to maintain equivalent LM states

The decoder is available at http://www.cs.jhu.edu/~zfli/

Page 16: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Thanks to Philip Resnik for letting me use the UMD Python decoder

Thanks to UMD MT group members for very helpful discussions

Thanks to David Chiang for Hiero and his original implementation in Python

Acknowledgements

Page 17: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Thank you!

Page 18: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University
Page 19: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form

a typical rule looks like:

Grammar Formalism

Page 20: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Grammar formalism Synchronous Context-free Grammar (SCFG)

Decoding task is defined as

Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by

applying the inference rules to prove more and more items, until a goal item is proved.

The hypotheses are stored in a structure called hypergraph.

Chart-parsing

Page 21: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

m-gram LM Integration Three Functions

Accumulate probability Estimate future cost State extraction

Cost Function

Finalized probability

Estimated probability

State extraction

Page 22: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory

Distributed Language Model (DLM) Training

Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs

Maximize the likelihood of a dev set Decoding

Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained

Parallel and Distributed Decoding

Page 23: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Decoding task is defined as

Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by

applying the inference rules to prove more and more items, until a goal item is proved.

The hypotheses are stored in a structure called hypergraph. State of an Item

Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity

Chart-parsing

Page 24: Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Hypergraph

X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA

X | 0, 4 | a cat | the mat

X ( 猫 , a cat)

X | 0, 4 | the mat | a cat

A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item

Source span, left-side nonterminal symbol, and left/right LM state

X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)

hyperedge

item

(X0, X0) S

Goal Item

(X0, X0) S

X ( 垫子 上 , the mat)

X (X0 的 X1, X1 of X0)

X (X0 的 X1, X1 on X0)

on the mata cat

猫 3垫子 0 上 1 的 2