A Syntax-Driven Bracketing Model for Phrase-Based Translation
description
Transcript of A Syntax-Driven Bracketing Model for Phrase-Based Translation
A Syntax-Driven Bracketing Model for Phrase-Based Translation
Deyi Xiong, et al.
ACL 2009
把 7 月 11 日 設立 為 航海 節
Introduction
• Machine Translation– Chinese to English– Chinese
• 把 7 月 11 日 設立 為 航海 節• An ideal case:
to establish July 11 as Sailing Festival day
Wrong Linguistic Structure
• 航海 節 is a syntactic constituent
把 7 月 11 日 設立 為 航海 節
to set up for navigation on July 11 knots
A Naive Solution
• Employ syntactic constraints– Fully respect linguistic structures
把 今天 設立 為 航海 節
A Naive Solution (2)
• Unfortunately, it damages the performance– Non-syntactic translations are sometimes useful
Sailing Festival dayestablish today as
Syntax-Driven Bracketing Model
• SDB model
• Translation unit is more important– Whether it is syntactic or non-syntactic
• Include but not limited to constituent matching/violation
• Protect the strength of the phrase-based system
Translation Unit
• Bracketable source phrase and its corresponding translation
• Bracketable– A source phrase is bracketable
• Its translation is contiguous
– A pair of neighboring phrases is bracketable• Their translations are contiguous after combined
establish today as
Translation Unit Examples
• Bracketable
把 今天 設立 為
establish today as
把 今天 設立 為
• 把 今天 設立 and 為 are bracketable
• 把 今天 設立 為 is bracketable
把 今天 設立 為
establish today as
Translation Unit Examples
• Unbracketable
• 設立 and 為 are unbracketable
• 設立 為 is unbracketable
Bracketing Instances Extraction
• Extract bracketable and unbracketable instances from training data– Aligned sentence pair + parsed source sentence
• Estimate whether a source phrase is bracketable at run time
SDB Features
Rule Features
• Rule Features (RF)– CFG rule
– Horizontal context
Rule Features (2)
S1: ADVP ADS2: VP VV AS NPS: VP ADVP VP
Path Features
• Path features (PF)– Path to roots
• S1 to the root of S
• S2 to the root of S
• S to the root of this tree
– Vertical context
Path Features (2)
S1: ADVP VPS2: VP VPS: VP IP
Constituent Boundary Matching Features
• Constituent Boundary Matching Features (CBMF)– Exact match
• Source phrase covers the boundaries of its tree
– Inside match• Source phrase covers a sequence of its tree
– Crossing match• Source phrase crosses the subtree of its tree
Constituent Boundary Matching Features (3)
Exactmatch
Insidematch
Crossingmatch
Integration into Phrase-based MT
• SDB model estimate the probability that a source phrase is bracketable. – Whether it can be translated as a unit
• Integrated into BTG MT system– Bracketing Transduction Grammar (Wu, 1997)
establish today as
把 今天 設立 為
as establish today
把 今天 設立 為
Straight Inverted
Experiment
• Comparing models– Baseline: BTG system– XP+ (Marton and Resnik, 2008)
• NP, VP, PP, ADVP….• Penalize each time when violating the syntactic bo
undaries. (soft constraint)
– UniSDB• Only S features
– BiSDB• S1, S2 and S features
Experiment (2)
• Chinese parser– Lexicalized PCFG parser (Xiong et al., 2005)
• Parallel corpus– FBIS corpus
• Word alignment– GIZA++
• Four-gram language model– Built with SRILM
– Xinhua section of the the English Gigaword corpus
• Maximum Entropy (ME) Trainer– Zhang 2004
Result
• SDB receives the largest feature weight– Imply its impact on decoder.
Baseline features(Common for phrase-based systems)
XP+ and SDB
Result (2)
• NIST MT-05 test set– Improvement of 1.67 BLEU over baseline
– Improvement of 0.59 BLEU over XP+
Result (3)
• Based on CBMF, adding rule and path feature achieves further improvement
• BiSDB is constantly better than UniSDB– Inner contexts (S1 and S2) are useful
XP+ and SDB
• Same– Consider syntactic constituent
• Different– XP+ only punishes non-syntactic source phrase– SDB is able to encourage non-syntactic if the phrase i
s bracketable
XP+ and SDB
Conclusion
• SDM model predict whether a source phrase can be translated as a unit.
• Appropriate constituent violations are helpful– Because it better inherit the strength of phrase-based
approach