December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu...

47
December 13, 2008 FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA http://terpconnect.umd.edu/~oard Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES”

Transcript of December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu...

Page 1: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

December 13, 2008 FIRE

Not So Surprising Anymore: Hindi from TIDES to FIRE

Douglas W. Oard and Tan Xu

University of Maryland, USAhttp://terpconnect.umd.edu/~oard

Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David YarowskyIdeas from: Just about all of “Team TIDES”

Page 2: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

A Very Brief History of NLP

• 1966: ALPAC– Refocus investment on enabling technologies

• 1990: IBM’s Candide MT system– Credible data-driven approaches

• 1999: TIDES– Translation, Detection, Extraction, Summarization

Page 3: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Surprise Language Framework

• English-only Users / Docs in language X

• Zero-resource start (treasure hunt)

• Sharply time constrained (29 days)

• Character-coded text

• Research-oriented

• Intense team-based collaboration

Page 4: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Schedule

Cebuano• Announce: Mar 5• Test Data: • Stop Work: Mar 14• Newsletter: April• Talks: May 30

(HLT)• Papers:

Hindi

Jun 1

Jun 27

Jun 30

August

Aug 5 (TIDES PI)

October (TALIP)

Page 5: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

300-Language Survey

Page 6: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

• Five evaluated tasks– Automatic CLIR (English queries)– Topic tracking (English examples, event-based)– Machine translation into English– English “Headline” generation– Entity tagging (five MUC types)

• Several useful components– POS tags, morphology, time expressions, parsing

• Several demonstration systems– Interactive CLIR (two systems)– Cross-language QA (English Q, Translated A)– Machine translation (+ Translation elicitation)– Cross-document entity tracking

Page 7: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

16 Participating TeamsCebuano + Hindi

USC-ISI

Maryland

NYU

Johns Hopkins

Sheffield

U Penn-LDC

CMU

UC Berkeley

MITRE

Hindi Only

U Mass

Alias-i

BBN

IBM

CUNY-Queens

K-A-T (Colorado)

Navy-SPAWAR

Page 8: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

TranslationDetection

Extraction

Summarization

BooksWeb

Books

WebPeople

Lexicons

Corpora

Time

ResourceHarvesting

Systems

ResearchResults

CaptureProcess Knowledge

Innovation Cycle

Coordination

StrategyPushOrganizeTalk

Page 9: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

10-Day Cebuano Pre-Exercise

Page 10: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Hindi Participants

Alias-I

UC

Berkeley

BB

N

CM

U

CU

NY

Johns Hopkins

IBM ISI

LDC

MIT

RE

NY

U

SP

AW

AR

U. S

heffield

U. M

assachusetts

U. M

aryland

ResourceGeneration

Detection

Extraction

Summarization

Translation

Page 11: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Hindi Resources• Much more data available than for Cebuano

• Data collected by all project participants – Web pages, News, Handbooks, Manually created, …– Dictionaries

• Major problems: – Many non-standard encodings– Often no converters available– Available converters often did not work properly

• Huge effort: data conversion and cleaning

• Resulting bilingual corpus: 4.2 million words

Page 12: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Hindi Translation Elicitation Server- Johns Hopkins University (David Yarowsky)

People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website

Performance is measured by Bleu score on 20% randomly interspersed test sentences Allows immediate way to rank and reward quality translations and exclude junk

Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days

Much cheaper than 25 cents/word for translation services or

5 cents/word for a prior MT-group’s recruitment of local studentsSample Interface:

user (English) translations typed here…

and here ….

User choice of 2-3encoding alternatives

Observed exponential growth in usage (before prizes ended)

viral advertising via family, friends, newgroups, …

$0 in recruitment, advertising, and administrative costs

Nightly incentive rewards given automatically via amazon.com gift certificates to email addresses (any $ amount, no fee)

no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary).

immediate positive feedback encourages continued use

Direct immediate access to worldwide labor market fluent in source language

Page 13: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

MT Challenges

• Lexicon coverage– Hindi morphology– Transliteration of Names

• Hindi word order: – SOV vs. SVO

• Training data inconsistencies, misalignments

• Incomplete tuning cycle– Same data/same model would give better results from

better tuning of model parameters

Page 14: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Example Translation

• Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …

Page 15: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

MT Results Overview - Hindi

50 60 70 80 90

bestcompeting

ISI public

ISI public+

ISIunrestricted

ISI late

Human 6

Human 5

PercentHuman CasedNISTr3n4score

Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)

Page 16: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Comparison to other languages

Language pair Words Training Data NIST score Relative Human NIST

Cebuano-English1.3M

(w/o Bible: 400K)? ?

Hindi-English 4.2M 7.4 73%

Chinese-English 150M 9.0 80%

Arabic-English 120M 10.1 89%

Note: different (news) test corpora, NIST scores incomparable

Page 17: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Hindi Week 1: Porting• Monday

– 2,973 BBC documents (UTF-8)– Batch CLIR (no stem, 2/3 known items rank 1)

• Tuesday– MIRACLE (“ITRANS”, gloss)– Stemmer (implemented from a paper)

• Wednesday– BBC CLIR collection (19 topic, known item)

• Friday:– Parallel text (Bible: 900k words, Web: 4k words) – Devanagari OCR system

Page 18: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Hindi Weeks 2/3/4: Exploration• N-grams (trigrams best for UTF-8)• Relative Average Term Frequency (Kwok)• Scanned bilingual dictionary (Oxford)• More topics for test collection (29)• Weighted structured queries (IBM lexicon)• Alternative stemmers (U Mass, Berkeley)• Blind relevance feedback• Transliteration• Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)

Page 19: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

IIIT Shabdanjali Dictionary in ISCII

IIIT Shabdanjali Dictionary in UTF_8

Original BBC-Hindi News Collection in html

Cleaned BBC-Hindi News Collection in UTF_8

Human Translated BBCNews Documents

Initial BLEU test collection

Eng-Hindi CLIR Test Collection:19 queries

Hindi Bible

Word Aligned Bible

First MT System

Paper: A Lightweight Stemmer for Hindi

Hindi Stemmer

First versionCLIR system

Second version CLIR system

June 2

June 3

June 4

June 5

the first version of Internet Archive (IA)

Web parallel

Hindi Morphological Analysers

Converter between iscii and utf8

Converter from utf8 devanagari

to hexadecimal and to ITRANS

University of Maryland LDC ISI Other Resources

Eng_Hindi dict with POS tags

June 6

June 9

June 10

June11

small web parallel corpus

Transliterated Hindi Bible

ITRANSHindi Bible

Full Hindi OCR System

Expanding Coverage ofGloss TranslationScored translation

lexicon

Third version CLIR system

Master DictionaryVersion 0.7

Cleaned Master Dictionary Version 0.7

Relevence Judgement of 19 queries

Fourth version CLIR system

Hindi stemmer inUTF-8 hext

Small BBC word alignment

Eng-Hindi CLIR Test Collection:29 queries

OCR System

JHU

2nd version BBC Small word alignment

June12

June 13

June16 BBC Small word

alignment in UTF-8

June 17

ISI Probabilistic Lexicon. Of June13

Berkeley ProbabilisticDictionaries of June13.

Master DictionaryBy source Version 0.7 (only

IIIT party)

Scored translation lexicon version 2

LDC Sentence AlignedParallel Texts Collections

Word alignment of LDC Parallel Texts

Auto Word AlignedHindi Bible

EMILLIE CORPUSVERSION 0.1

June 18Ocred Oxford

Hindi-English Dictionary

June 19

Complete Matchine Translated BBC Collection

BBN RevisedHindi stemmer in

UTF-8 hext

June 20

Scored translation lexicon version 2.1

Cleaned Complete Matchine Translated BBC Collection

XML Format Ocred OxfordHindi-English Dictionary

XML Format Ocred OxfordHindi-English Dictionary

Version3.0

June 23 ISI Emellie Word Alignment June 23

ISI Emellie Word Alignment June 18

Scored LexiconJune 24

BBN/UMD Topic Lists

Second VersionBBN/UMD Topic Lists

Page 20: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Formative Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25 30

Day (=Date-1)

Mea

n R

ecip

roca

l R

ank

Page 21: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Lessons Learned

• We learned more from 2 languages than 1– Simple techniques worked for Cebuano– Hindi needed more (encoding, MT, transliteration)

• Usable systems can be built in a month– Parallel text for MT is the pacing item

• Broad collaboration yielded useful insights

Page 22: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Our FIRE-2008 Goals

• Evaluate Surprise Language resources– IBM and LDC translation lexicons– Berkeley Stemmer

• Compare CLIR techniques– Probabilistic Structured Queries (PSQ)– Derived Aggregated Meaning Matching (DAMM)

Page 23: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Comparing Test Collections

FIRE-2008Test Collection

Surprise Language Test Collection

Query language English English

Doc language Hindi Hindi

Topics 50 15

Documents 95,215 41,697

Avg rel docs/topic 68 41

Page 24: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

0.37

0.47 0.47

0.38

0.0

0.1

0.2

0.3

0.4

0.5

UMD UMD-BRF Umass Umass-BRF

Mea

n A

vera

ge P

reci

sion

Monolingual Baselines

Our FIRE-2008 Training (TDN) 2003 Surprise Language (TDNS)

15 Surprise Language topics

Page 25: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

A Ranking Function: Okapi BM25

])(7

)(*8

)),()(

*9.03.0(

)),(*2.2(][

)5.0)((

)5.0)(([log

eqtf

eqtf

detfavdlddldetf

edf

edfN

Qek

k

k

document frequency term frequency

query term query document length document

average document length term frequency in query

])(7

)(*8

)),()(

*9.03.0(

)),(*2.2(][

)5.0)((

)5.0)(([log

eqtf

eqtf

detfavdlddldetf

edf

edfN

Qek

k

k

])(7

)(*8

)),()(

*9.03.0(

)),(*2.2(][

)5.0)((

)5.0)(([log

eqtf

eqtf

detfavdlddldetf

edf

edfN

Qek

k

k

Page 26: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Estimating TF and DF for Query Terms

jf

kjjiki dftffepdetf ),(*)(),(

jf

jjii fdffepedf )(*)()(

jf

)( jfdf

),( kj dftf

),( ki detf

)( ji fep

)( iedf

3f2f 4f1f

0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9

20

50

5025

3040

0.30.4

0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

0.1

200

0.2e1

0.4

0.3

0.2

0.1

f1

f2

f3

f4

Page 27: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Bidirectional Translation

wonders of ancient world (CLEF Topic 151)

se//0.31demande//0.24demander//0.08peut//0.07merveilles//0.04question//0.02savoir//0.02on//0.02bien//0.01merveille//0.01pourrait//0.01

Unidirectional:

si//0.01sur//0.01me//0.01t//0.01emerveille//0.01ambition//0.01merveilleusement//0.01veritablement//0.01cinq//0.01hier//0.01

merveilles//0.92merveille//0.03emerveille//0.03merveilleusement//0.02

Bidirectional:

Page 28: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Surprise LanguageTranslation Lexicons

SourceTranslation

pairsEnglish words

Hindi words

LDC (dict) 69,195 21,842 33,251

IBM (stat) 181,110 50,141 77,517

ISI (stat) 512,248 65,366 97,275

p(h|e)

p(e|h)

40%

60%

Page 29: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

George. W. Bush乔治 . 布什

shrubbery草丛

grass lawn草坪

marijuanagrass大麻

bush

grass

0.7

0.3

0.8

0.2

布什

草丛

大麻

0.6

1.0

0.4

1.0

Synonym Sets as Models of Term Meaning

Page 30: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

“Meaning Matching” Variants

icon Query translation

knowledge?

Document translation

knowledge?

Query language

Synsets?

Document language synsets?

Pre-aligned

Synsets?

FAMM

DAMMPAMMq

PAMMd

IMM

PSQAPSQ

PDT

APDT

(Q) (D)

(Q) (D)

(Q) D

Q (D)

Q D

Q D

Q (D)

Q D

(Q) D

Page 31: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

f1 (0.32)

f2 (0.21)

f3 (0.11)

f4 (0.09)

f5 (0.08)

f6 (0.05)

f7 (0.04)

f8 (0.03)

f9 (0.03)

f10 (0.02)

f11 (0.01)

f12 (0.01)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f1 f1f2f3f4f5

f1f2f3f4

f1f2f3f4f5f6f7

f1f2f3f4f5f6f7f8f9f10

f11

f12

f1 f1 f1f2

f1f2

f1f2f3

f1

Cumulative Probability ThresholdTranslations

Pruning Translations

Page 32: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Comparing PSQ and DAMM

0%

20%

40%

60%

80%

100%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Cumulative Probability Threshold

MA

P:C

LIR

/Mon

olin

gual

PSQ

DAMM

15 Surprise Language topics, TDN queries

Page 33: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

1/3 of Topics Improve w/DAMM

-0.2

-0.1

0.0

0.1

0.2

MA

P: D

AM

M-P

SQ

15 Surprise Language topics, TDN queries

Page 34: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

0.0

0.1

0.2

0.3

0.4

clir-EH-umd-man0UCB Stemmer

clir-EH-umd-man1YASS Stemmer

clir-EH-umd-man2UCB StemmerPre-Trans BRF

clir-EH-median mono-HH-best

Mea

n A

vera

ge P

reci

sion

Official CLIR Results

50 FIRE-2008 topics, TDN queries

Page 35: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Comparing Stemmers

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

MA

P: Y

AS

S-B

erk

ele

y

YASS Stemmer Better

Berkeley Stemmer Better

50 FIRE-2008 Topics, TDN queries

Page 36: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Best (Overall) CLIR Run

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

AP

: cl

ir-E

H-u

md-m

an2 - M

edia

n

clir-EH-umd-man2 Better

Median Better

41 FIRE-2008 topics with ≥ 5 relevant documents, TDN queries

Page 37: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Cross-Language “Retrieval”

Search

Translated Query

Ranked List

QueryTranslation

Query

Page 38: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Interactive Translingual Search

Search

Translated Query

Selection

Ranked List

Examination

Document

Use

Document

QueryFormulation

QueryTranslation

Query

Query Reformulation

MT

Translated “Headlines”

English Definitions

Page 39: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

UMass Interactive Hindi CLIR

Page 40: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

MIRACLE Design Goals

• Value-added interactive search– Regardless of available resources

• Maximize the value of minimal resources– Bilingual term list + Comparable English text

• Leverage other available resources– Parallel text, morphology, MT, summarization

Page 41: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.
Page 42: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.
Page 43: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.
Page 44: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.
Page 45: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Summary

• Larger Hindi test collection– Prerequisite for insightful failure analysis

• Surprise Language resources were useful– Translation lexicons– Berkeley stemmer (combine with YASS?)

• DAMM is robust with weaker resources

Page 46: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

Looking Forward

• Shared resources– Test collections– Translation lexicons (or parallel corpora)– Stemmers

• System infrastructure– IL variants of Indri/Terrier/Zettair/Lucene

• Community-based cycle of innovation– Students are our most important “result”

Page 47: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.

For More Information

• Team TIDES newsletter– http://language.cnri.reston.va.us/TeamTIDES.html – Cebuano: April 2003– Hindi: October 2003

• Papers– NAACL/HLT 2003– MT Summit 2003– ACM TALIP Special Issues(Jun/Sep 2003)