Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model

State Maintenance

Zhifei Li and Sanjeev KhudanpurJohns Hopkins University

Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al.,

2007) Equivalent LM state maintenance We plan to add more functions soon

JOSHUA: a scalable open-source parsing-based MT decoder

New!

Chiang (2007)

Grammar formalism Synchronous Context-free Grammar (SCFG)

Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by

applying the inference rules to prove more and more items, until a goal item is proved.

The hypotheses are stored in a hypergraph.

Chart-parsing

Hypergraph

item

hyperedge

on the mat a cat

X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA

X | 0, 4 | a cat | the mat

X ( 猫 , a cat)

X | 0, 4 | the mat | a cat

X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)

(X0, X0) S

Goal Item

(X0, X0) S

X ( 垫子上 , the mat)

X (X0 的 X1, X1 of X0)

X (X0 的 X1, X1 on X0)

猫 3垫子 0 上 1 的 2

of

Hypergraph and Trees

猫 3垫子 0 上 1 的 2

X ( 猫 , a cat) X ( 垫子上 , the mat)

X (X0 的 X1, X0 X1)

(X0, X0) S

the mat a cat

X ( 猫 , a cat)

猫 3垫子 0 上 1 的 2


(X0, X0) S

X (X0 的 X1, X0 ’s X1)

the mat ’s a cat

X ( 猫 , a cat)

猫 3

a cat on the mat

垫子 0 上 1 的 2


(X0, X0) S

X (X0 的 X1, X1 on X0)

X ( 猫 , a cat)

猫 3垫子 0 上 1 的 2


(X0, X0) S

X (X0 的 X1, X1 of X0)

A cat of the mat

Back-off Parameterization of m-gram LMs LM probability computation

Observations A larger m leads to more backoff Default backoff weight is 1

For a m-gram not listed, β(.) = 1

-4.250922 party files

-4.741889 party filled

-4.250922 party finance -0.1434139

-4.741889 party financed

-4.741889 party finances -0.2361806

-4.741889 party financially

-3.33127 party financing -0.1119054

-3.277455 party finished -0.4362795

-4.012205 party fired

-4.741889 party fires

Equivalent State Maintenance: Right-side

state words State Prefix IS-A-PREFIX equivalent statefuture words

el-2 el-1 el el-2 el-1 el noel+1el+2el+3…

* el-1 el el-1 el noel+1el+2el+3…

• Why not right to left?

• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.

Independent from el-2

IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no

• For the case of a 4-gram LM P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el)

* el-1 el

* * el

* * el el noel+1el+2el+3… * * *

Backoff weight is one

Equivalent State Maintenance: Left-side

state words State Suffix IS-A-SUFFFIX equivalent statefuture words

e1 e2 e3 no… e-2e-1e0

e1 e2 * e1 e2 no

e1 e2 e3

Independent from e3

P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)

• For the case of a 4-gram LM P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2)

• Why not left to right?

• Whether a word can be ignored depends on both its left and right sides, which complicates the procedure.

… e-2e-1e0

e1 * * e1 no… e-2e-1e0 * * *

e1 * *

e1 e2 *

Finalized probability

P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0)

Remember to factor in backoff weights later

Equivalent State Maintenance: summary

Modified Cost FunctionOriginal Cost Function

Finalized

probability

Estimated probability

State extraction

Experimental Results: Decoding Speed System Training

Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs

obtain 570k sentence pairs LM training data: Gigaword and English side of bitext

Decoding speed Number of rules: 3M Number of m-grams: 49M

38 times faster than the baseline!

Experimental Results: Distributed LM Distributed Language Model

Eight 7-gram LMs Decoding speed: 12.2 sec/sent

Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance

Sparse LM: a 7-gram LM built on about 19M words

Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method.

Backoff happens less frequently Inefficient suffix/prefix information lookup

5070

90 120150 200

30

Summary

We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding

millions of sentences in a large-scale discriminative training task

We propose a method to maintain equivalent LM states

The decoder is available at http://www.cs.jhu.edu/~zfli/

Thanks to Philip Resnik for letting me use the UMD Python decoder

Thanks to UMD MT group members for very helpful discussions

Thanks to David Chiang for Hiero and his original implementation in Python

Acknowledgements

Thank you!

Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form

a typical rule looks like:

Grammar Formalism

Grammar formalism Synchronous Context-free Grammar (SCFG)

Decoding task is defined as

Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by


The hypotheses are stored in a structure called hypergraph.

Chart-parsing

m-gram LM Integration Three Functions

Accumulate probability Estimate future cost State extraction

Cost Function

Finalized probability

Estimated probability

State extraction

Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory

Distributed Language Model (DLM) Training

Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs

Maximize the likelihood of a dev set Decoding

Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained

Parallel and Distributed Decoding

Decoding task is defined as

Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by


The hypotheses are stored in a structure called hypergraph. State of an Item

Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity

Chart-parsing

Hypergraph

X | 3, 4 | a cat | NA X | 0, 2 | the mat | NA

X | 0, 4 | a cat | the mat

X ( 猫 , a cat)

X | 0, 4 | the mat | a cat

A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item

Source span, left-side nonterminal symbol, and left/right LM state

X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1)

hyperedge

item

(X0, X0) S

Goal Item

(X0, X0) S


X (X0 的 X1, X1 of X0)

X (X0 的 X1, X1 on X0)

on the mata cat

猫 3垫子 0 上 1 的 2

Zhifei Li and Sanjeev Khudanpur Johns Hopkins University

Documents

Transcript of Zhifei Li and Sanjeev Khudanpur Johns Hopkins University