Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex...

Coping with Surprise:Multiple CMU MT Approaches

Alon LavieLori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel,

Ralf Brown, Robert FrederkingLanguage Technologies Institute

Carnegie Mellon University

Joint work with: Katharina Probst, Erik Peterson, Joy Zhang,

Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen

August 5, 2003 TIDES PI Meeting/ SLE 2

Main Hindi SLE Efforts

• Data Collection– Elicited Data Collection– Data from contacts in India– Web Crawling

• Language Processing Utilities– Morphology– Encoding identification and conversion

• MT system development– XFER system– SMT system– EBMT system


Elicited Data Collection• Goal: Acquire high quality word aligned Hindi-

English data to support XFER system development (grammar learning)

• Recruited team of ~20 bilingual speakers at CMU and in India

• Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank

• Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi

• Resulting in total of 17589 word aligned translated phrases (~50KB)


The CMU Elicitation Tool


Elicited Data Collection

• Problems and issues:– English Hindi direction allowed us to use

the Penn TreeBank to extract accurate phrases

– However, bilingual informants not accustomed to type Hindi typos

– Limits utility of the data, less effect on accuracy

– Using the WSJ portion of the PennTB may have been a better fit for genre


Main CMU Contributions to SLE Shared Resources

• Elicited Data Corpus (~50KB)• Indian Government Parallel Text ERDC.tgz (338 MB)• CMU Phrase Lexicon Joyphrase.gz (3.5 MB)• Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB)• CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3

MB)• CMU Phrases and sentences CMU-phrases+sentences.zip (468

KB)• Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz

(54KB)

Web Crawling:• Most sites with possible parallel texts had Hindi in proprietary

encodings• Osho http://www.osho.com/Content.cfm?Language=Hindi


Hindi Morphological Analyzer

• http://www.iiit.net/ltrc/morph/index.htm

• High quality and high coverage morphological analyzer from IIIT– Input: full inflected forms (RomanWX

encoding)– Output: root form + collection of features

• Installing as a local server required some effort, e.g. UTF-8 RomanWX

• Used primarily in our XFER system


Other Hindi Processing Utilities

• Encoding identification and conversion tools– Built two automatic encoding identifiers,

used for web data collection– Located and installed encoding converters

from a variety of encodings– Most widely used was UTF-8 to RomanWX


XFER System for Hindi

• Three passes:– match against phrase-to-phrase entries (full-forms,

no morphology)– morphologically analyze input words and match

against lexicon • matches feed into manual and learned transfer rules

– match original word against lexicon - provides word-to-word translation as fall-back for input not otherwise covered

• Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35

• “Strong” decoding with lattices+LM: NIST 5.47


Examples of Learned Rules{NP,14244}

;;Score:0.0429

NP::NP [N] -> [DET N]

(

(X1::Y2)

)

{NP,14434}

;;Score:0.0040

NP::NP [ADJ CONJ ADJ N] ->

[ADJ CONJ ADJ N]

(

(X1::Y1) (X2::Y2)

(X3::Y3) (X4::Y4)

)

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))


SMT System for Hindi

• Resources– Trained on commonly available bilingual corpora– Used bilingual Hindi-English dictionary– Named Entities– 70 million word English LM

• CMU SMT System– Tuned on ISI devtest data– Monotone decoding, as reordering did not result in

improvement on this test set– Mixed casing based on Named Entities and simple

rules

• NIST score: 6.74


EBMT System for Hindi

• Training data: same as SMT + a few hand-written equivalent class generalizations

• English LM built from APW portion of GigaWord Corpus (600M words)

• Encoding variation: raw training data in a variety of different encodings all converted to UTF-8 (already supported by EBMT)

• Preprocessing of example phrases to improve word matching:– Match Hindi possessive with English ‘s

• NIST Score: 5.98


A Truly Limited Data Scenario for Hindi-to-English

• Put together a scenario with very miserly data resources:– Elicited Data corpus: 17589 phrases– Cleaned portion (top 12%) of LDC dictionary: ~2725

Hindi words (23612 translation pairs)– Manually acquired resources during the SLE:

• 500 manual bigram translations• 72 manually written phrase transfer rules• 105 manually written postposition rules• 48 manually written time expression rules

• No additional parallel text!!• Results presented tomorrow…


Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website]• Vogel email 6/2

– Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html

– General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm

– Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html– English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/– A small English to Urdu dictionary:

http://www.cs.wisc.edu/~navin/india/urdu.dictionary– The Bible at: http://www.gospelcom.net/ibs/bibles/– The Emille Project: http://www.emille.lancs.ac.uk/home.htm– [Hardcopy phrasebook references]– A Monthly Newsletter of Vigyan Prasar– http://www.vigyanprasar.com/dream/index.asp– Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm



FOUND RESOURCES not on LDC Website: (cont.)[From TidesSLList Archive website]• Tribble email, via Vogel 6/2 Possible parallel websites:

– http://www.bbc.co.uk (English)– http://www.bbc.co.uk/urdu/ (Hindi)– http://sify.com/news_info/news/– http://sify.com/hindi/– http://in.rediff.com/index.html (English)– http://www.rediff.com/hindi/index.html (Hindi)– http://www.indiatoday.com/itoday/index.html– http://www.indiatodayhindi.com

• Vogel email 6/2 – http://us.rediff.com/index.html– http://www.rediff.com/hindi/index.html [Already listed]– http://www.niharonline.com/– http://www.niharonline.com/hindi/index.html– http://www.boloji.com/hindi/index.html– http://www.boloji.com/hindi/hindi/index.htm– The Gita Supersite http://www.gitasupersite.iitk.ac.in/– Press Information Bureau, Government of India

• English: http://pib.nic.in/• Hindi: http://pib.nic.in/urdu/hindimain.html

http://sify.com/news_info/news/

http://sify.com/hindi/

http://www.indiatoday.com/itoday/index.html

http://us.rediff.com/index.html



FOUND RESOURCES not on LDC Website: (cont.)[From TidesSLList Archive website]• 6/20 Parallel Hindi/English webpages:

– GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.]

SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE:[From TidesSLList Archive website:]• Frederking email 6/3 [announced], 6/4 [provided]

– Ralf Brown's idenc encoding classifier

• Frederking email 6/5– PDF extractions from LanguageWeaver URLs:

http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/

• Frederking email 6/5– Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter

• Frederking email 6/11– Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so

that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz



SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.)[From TidesSLList Archive website:]• Levin email 6/13

– Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/

• Frederking email 6/20– Undecoded but believed to be parallel webpages:

http://progress.is.cs.cmu.edu/surprise/merged_urls.txt– PDF extractions from same:

http://progress.is.cs.cmu.edu/surprise/merged_urls/

• Frederking email 6/24– Several individual parallel webpages; sites may have more:

www.commerce.nic.in/setup.htmwww.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htmmohfw.nic.in/oph.htm wwww.mp.nic.in

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex...

Documents

Transcript of Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex...