Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex...
-
Upload
piers-washington -
Category
Documents
-
view
215 -
download
2
Transcript of Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex...
Coping with Surprise:Multiple CMU MT Approaches
Alon LavieLori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel,
Ralf Brown, Robert FrederkingLanguage Technologies Institute
Carnegie Mellon University
Joint work with: Katharina Probst, Erik Peterson, Joy Zhang,
Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen
August 5, 2003 TIDES PI Meeting/ SLE 2
Main Hindi SLE Efforts
• Data Collection– Elicited Data Collection– Data from contacts in India– Web Crawling
• Language Processing Utilities– Morphology– Encoding identification and conversion
• MT system development– XFER system– SMT system– EBMT system
August 5, 2003 TIDES PI Meeting/ SLE 3
Elicited Data Collection• Goal: Acquire high quality word aligned Hindi-
English data to support XFER system development (grammar learning)
• Recruited team of ~20 bilingual speakers at CMU and in India
• Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank
• Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi
• Resulting in total of 17589 word aligned translated phrases (~50KB)
August 5, 2003 TIDES PI Meeting/ SLE 4
The CMU Elicitation Tool
August 5, 2003 TIDES PI Meeting/ SLE 5
Elicited Data Collection
• Problems and issues:– English Hindi direction allowed us to use
the Penn TreeBank to extract accurate phrases
– However, bilingual informants not accustomed to type Hindi typos
– Limits utility of the data, less effect on accuracy
– Using the WSJ portion of the PennTB may have been a better fit for genre
August 5, 2003 TIDES PI Meeting/ SLE 6
Main CMU Contributions to SLE Shared Resources
• Elicited Data Corpus (~50KB)• Indian Government Parallel Text ERDC.tgz (338 MB)• CMU Phrase Lexicon Joyphrase.gz (3.5 MB)• Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB)• CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3
MB)• CMU Phrases and sentences CMU-phrases+sentences.zip (468
KB)• Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz
(54KB)
Web Crawling:• Most sites with possible parallel texts had Hindi in proprietary
encodings• Osho http://www.osho.com/Content.cfm?Language=Hindi
August 5, 2003 TIDES PI Meeting/ SLE 7
Hindi Morphological Analyzer
• http://www.iiit.net/ltrc/morph/index.htm
• High quality and high coverage morphological analyzer from IIIT– Input: full inflected forms (RomanWX
encoding)– Output: root form + collection of features
• Installing as a local server required some effort, e.g. UTF-8 RomanWX
• Used primarily in our XFER system
August 5, 2003 TIDES PI Meeting/ SLE 8
Other Hindi Processing Utilities
• Encoding identification and conversion tools– Built two automatic encoding identifiers,
used for web data collection– Located and installed encoding converters
from a variety of encodings– Most widely used was UTF-8 to RomanWX
August 5, 2003 TIDES PI Meeting/ SLE 9
XFER System for Hindi
• Three passes:– match against phrase-to-phrase entries (full-forms,
no morphology)– morphologically analyze input words and match
against lexicon • matches feed into manual and learned transfer rules
– match original word against lexicon - provides word-to-word translation as fall-back for input not otherwise covered
• Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35
• “Strong” decoding with lattices+LM: NIST 5.47
August 5, 2003 TIDES PI Meeting/ SLE 10
Examples of Learned Rules{NP,14244}
;;Score:0.0429
NP::NP [N] -> [DET N]
(
(X1::Y2)
)
{NP,14434}
;;Score:0.0040
NP::NP [ADJ CONJ ADJ N] ->
[ADJ CONJ ADJ N]
(
(X1::Y1) (X2::Y2)
(X3::Y3) (X4::Y4)
)
{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))
August 5, 2003 TIDES PI Meeting/ SLE 11
SMT System for Hindi
• Resources– Trained on commonly available bilingual corpora– Used bilingual Hindi-English dictionary– Named Entities– 70 million word English LM
• CMU SMT System– Tuned on ISI devtest data– Monotone decoding, as reordering did not result in
improvement on this test set– Mixed casing based on Named Entities and simple
rules
• NIST score: 6.74
August 5, 2003 TIDES PI Meeting/ SLE 12
EBMT System for Hindi
• Training data: same as SMT + a few hand-written equivalent class generalizations
• English LM built from APW portion of GigaWord Corpus (600M words)
• Encoding variation: raw training data in a variety of different encodings all converted to UTF-8 (already supported by EBMT)
• Preprocessing of example phrases to improve word matching:– Match Hindi possessive with English ‘s
• NIST Score: 5.98
August 5, 2003 TIDES PI Meeting/ SLE 13
A Truly Limited Data Scenario for Hindi-to-English
• Put together a scenario with very miserly data resources:– Elicited Data corpus: 17589 phrases– Cleaned portion (top 12%) of LDC dictionary: ~2725
Hindi words (23612 translation pairs)– Manually acquired resources during the SLE:
• 500 manual bigram translations• 72 manually written phrase transfer rules• 105 manually written postposition rules• 48 manually written time expression rules
• No additional parallel text!!• Results presented tomorrow…
August 5, 2003 TIDES PI Meeting/ SLE 14
Other CMU Contributions to SLE Shared Resources
FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website]• Vogel email 6/2
– Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html
– General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm
– Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html– English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/– A small English to Urdu dictionary:
http://www.cs.wisc.edu/~navin/india/urdu.dictionary– The Bible at: http://www.gospelcom.net/ibs/bibles/– The Emille Project: http://www.emille.lancs.ac.uk/home.htm– [Hardcopy phrasebook references]– A Monthly Newsletter of Vigyan Prasar– http://www.vigyanprasar.com/dream/index.asp– Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm
August 5, 2003 TIDES PI Meeting/ SLE 15
Other CMU Contributions to SLE Shared Resources
FOUND RESOURCES not on LDC Website: (cont.)[From TidesSLList Archive website]• Tribble email, via Vogel 6/2 Possible parallel websites:
– http://www.bbc.co.uk (English)– http://www.bbc.co.uk/urdu/ (Hindi)– http://sify.com/news_info/news/– http://sify.com/hindi/– http://in.rediff.com/index.html (English)– http://www.rediff.com/hindi/index.html (Hindi)– http://www.indiatoday.com/itoday/index.html– http://www.indiatodayhindi.com
• Vogel email 6/2 – http://us.rediff.com/index.html– http://www.rediff.com/hindi/index.html [Already listed]– http://www.niharonline.com/– http://www.niharonline.com/hindi/index.html– http://www.boloji.com/hindi/index.html– http://www.boloji.com/hindi/hindi/index.htm– The Gita Supersite http://www.gitasupersite.iitk.ac.in/– Press Information Bureau, Government of India
• English: http://pib.nic.in/• Hindi: http://pib.nic.in/urdu/hindimain.html
August 5, 2003 TIDES PI Meeting/ SLE 16
Other CMU Contributions to SLE Shared Resources
FOUND RESOURCES not on LDC Website: (cont.)[From TidesSLList Archive website]• 6/20 Parallel Hindi/English webpages:
– GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.]
SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE:[From TidesSLList Archive website:]• Frederking email 6/3 [announced], 6/4 [provided]
– Ralf Brown's idenc encoding classifier
• Frederking email 6/5– PDF extractions from LanguageWeaver URLs:
http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/
• Frederking email 6/5– Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter
• Frederking email 6/11– Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so
that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz
August 5, 2003 TIDES PI Meeting/ SLE 17
Other CMU Contributions to SLE Shared Resources
SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.)[From TidesSLList Archive website:]• Levin email 6/13
– Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/
• Frederking email 6/20– Undecoded but believed to be parallel webpages:
http://progress.is.cs.cmu.edu/surprise/merged_urls.txt– PDF extractions from same:
http://progress.is.cs.cmu.edu/surprise/merged_urls/
• Frederking email 6/24– Several individual parallel webpages; sites may have more:
www.commerce.nic.in/setup.htmwww.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htmmohfw.nic.in/oph.htm wwww.mp.nic.in