A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented...
![Page 1: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/1.jpg)
A Hierarchical Phrase-Based Model for Statistical Machine
TranslationAuthor: David Chiang
Presented by Achim RuoppFormulas/illustrations/numbers extracted from referenced
papers
![Page 2: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/2.jpg)
Outline
• Phrase Order in Phrase-based Statistical MT
• Using synchronous CFGs to solve the issue
• Integrating the idea into an SMT system• Results• Conclusions• Future work• My Thoughts/Questions
![Page 3: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/3.jpg)
Phrase Order in Phrase-based Statistical MT
• Example from [Chiang2005]:
![Page 4: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/4.jpg)
Phrase Order in Phrase-based Statistical MT
• Translation of the example with a phrase-based SMT system (Pharao, [Koehn2004])[Aozhou] [shi] [yu] [Bei Han] [you] [bangjiao]1
[de shaoshu guojia zhiyi]
[Australia] [is] [dipl. rels.]1 [with] [North Korea] [is] [one of the few countries]
• Uses learned phrase translations• Accomplishes local phrase-reordering• Fails on overall reordering of phrases• Not only applicable to Chinese, but also
Japanese (SOV order), German (scrambling)
![Page 5: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/5.jpg)
Idea: Rules for Subphrases
• Motivation“phrases are good for learning reorderings of words, we
can use them to learn reorderings of phrases as well”• Rules with “placeholders” for subphrases
– <yu [1] you [2], have [2] with [1]>• Learned automatically from bitext without
syntactical annotation• Formally syntax-based but not linguistically
syntax-based• “the result sometimes resembles a syntactician’s
grammar but often does not”
![Page 6: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/6.jpg)
Synchronous CFGs
• Developed in the 60’s for programming-language compilation [Aho1969]
• Separate tutorial by Chiang describing them [Chiang2005b]
• In NLP synchronous CFGs have been used for – Machine translation – Semantic interpretation
![Page 7: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/7.jpg)
Synchronous CFGs
• Like CFGs, but production have two right hand sides– Source side– Target side– Related through linked non-terminal symbols
• E.g. VP → <V[1] NP[2],NP[2] V[1]>• One-to-one correspondence• Non-terminal of type X is always linked to same
type• Productions applied in parallel to both sides to
linked non-terminals
![Page 8: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/8.jpg)
Synchronous CFGs
![Page 9: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/9.jpg)
Synchronous CFGs• Limitations
– No Chomsky normal form• Has implications for complexity of decoder
– Only limited closure under composition– Sister-reordering only
![Page 10: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/10.jpg)
Model
• Using the log-linear model [Och2002]– Presented by Bill last week
![Page 11: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/11.jpg)
Model – Rule FeaturesiXXw
ii
),(),( • P(γ|α) and P(α|γ)• Lexical weights Pw(γ|α) and Pw(α|γ)
– Estimation how we words in α translate to words in γ
• Phrase penalty exp(1)– Allows model to learn longer/shorter derivations
• Exception: glue rule weights– w(S → <X[1],X[1] >) =1– w(S → <S[1]X[2],S[1]X[2]>) = exp(-λg)– Λg controls model’s preference for hierarchical
phrases over serial phrase combination
![Page 12: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/12.jpg)
Model – Additional Features
)exp()()()(,,
eeprwDw wpDjir
lmlm
• Separated out from rule weights– Notational convenience– Conceptually cleaner (necessary for polynominal-time
decoding)
• Derivation D– Set of triples <r,i,j>: apply grammar rule r for rewriting
a non-terminal in span f(D) from i to j– Ambiguous
![Page 13: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/13.jpg)
Training
• Training is starting from a symetrical, word-aligned corpus
• Adopted from [Och2004] and [Koehn2003] – How to get from a one-directional alignment to
a symetrical alignment– How to find initial phrase pairs
• alternative would be Marcu & Wong 2002 that Ping presented [Marcu2002]
![Page 14: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/14.jpg)
Training
![Page 15: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/15.jpg)
Training
• Scheme leads unfortunately – To a large number of rules– With false ambiguity
• Grammar is filtered to– Balance grammar size and performance– Five filter criteria e.g.
• produce only two non-terminals• Initial phrase length limited to 10
![Page 16: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/16.jpg)
Decoding
• Our good old friend - the CKY parser• Enhanced with
– Beam search– Postprocessor to map French derivations to English
derivations
)(maxarg
)(..Dwe
fDftDs
![Page 17: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/17.jpg)
Results
• Baseline– Pharao [Koehn2003], [Koehn2004]– Minimum error rate training on BLEU measure
• Hierarchical model– 2.2 Million rules after filtering down from 24 Million– 7.5% relative improvement
• Additional constituent feature– Additional feature favoring syntactic parses– Trained on 250k sentences Penn Chinese Treebank– Improved accuracy only in development set
![Page 18: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/18.jpg)
Learned Feature Weights
• Word = word penalty• Phr = phrase penalty (pp)
• λg penalizes glue rules much less than λpp does regular rules– i.e. “This suggests that the model will prefer serial combination
of phrases, unless some other factor supports the use of hierarchical phrases ”
![Page 19: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/19.jpg)
Conclusions
• Hierarchical phrase pairs that can be learned data without syntactically annotation
• Hierarchical phrase pairs improve translation accuracy significantly
• Added syntactic information (constituent feature did not provide statistically significant gain
![Page 20: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/20.jpg)
Future Work
• Move to more syntactically motivated grammar
• Reducing grammar size to allow more aggressive training settings
![Page 21: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/21.jpg)
My Thoughts/Questions
• Really interesting approach to bring “syntactic” information into SMT
• Example sentence was not translated correctly– Missing words are problematic
• Can phrase reordering be also learned by lexicalized phrase reordering models [Och2004]?
• Why did constituent feature only improve accuracy in development set, but not in test set?
• Does data sparseness influence the learned feature weights?
• What syntactical features are already built into Pharao?
![Page 22: A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.](https://reader030.fdocuments.us/reader030/viewer/2022032522/56649d6c5503460f94a4c15d/html5/thumbnails/22.jpg)
References• [Aho1969] Aho, A. V. and J. D. Ullman. 1969. Syntax directed translations and the
pushdown assembler. Journal of Computer and System Sciences, 3:37–56.• [Chiang2005]: Chiang, David. 2005. A Hierarchical Phrase-Based Model for
Statistical Machine Translation. In Proceedings of ACL 2005, pages 263–270. • [Chiang2005b]:
http://www.umiacs.umd.edu/~resnik/ling645_fa2005/notes/synchcfg.pdf• [Koehn2003]: Koehn, Philipp. 2003. Noun Phrase Translation. Ph.D.• thesis, University of Southern California.• [Koehn2004]: Koehn, Phillip. 2004. Pharaoh: a beam search decoder for phrase-
based statistical machine translation models. In Proceedings of the Sixth Conference of the Association for Machine Translation in the Americas, pages 115–124.
• [Marcu2002]: Marcu, Daniel and William Wong. 2002. A phrasebased, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on
• Empirical Methods in Natural Language Processing (EMNLP), pages 133–139.• [Och 2002]: Och, Franz Josef and Hermann Ney. 2002. Discriminative training and
maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the ACL, pages 295–302.
• [Och2004]: Och, Franz Josef, Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30:417–449.