22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of...
-
Upload
spencer-rose -
Category
Documents
-
view
212 -
download
0
Transcript of 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of...
![Page 1: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/1.jpg)
22 August 2003 CLEF 2003
The 2003 TIDES Surprise Language Exercise
Douglas W. Oard
University of Maryland
![Page 2: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/2.jpg)
Outline
• Thinking out of the box
• Some results
• Lesson Learned
![Page 3: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/3.jpg)
Surprise Language Framework
• Zero-resource start (treasure hunt)
• Time constrained (10 or 29 days)
• English Users / Documents in language X
• Character-coded text
• Research-oriented
• Intensely collaborative (team-based)
![Page 4: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/4.jpg)
Schedule
Cebuano• Announce: Mar 5• Test Data: • Stop Work: Mar 14• Newsletter: April• Talks: May 30
(HLT)• Papers:
Hindi
Jun 1
Jun 27
Jun 30
August
Aug 5 (TIDES PI)
Aug 15 (TALIP)
![Page 5: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/5.jpg)
16 Participating TeamsCebuano and Hindi
ISI
Maryland
NYU
Johns Hopkins
Sheffield
LDC
CMU
UC Berkeley
MITRE
Hindi Only
U Mass
Alias-i
BBN
IBM
CUNY
KAT
SPAWAR
![Page 6: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/6.jpg)
• Five evaluated tasks– Automatic CLIR (English queries)– Topic tracking (English examples, event-based)– Machine translation into English– English “Headline” generation– Entity tagging (five MUC types)
• Several useful components– POS tags, morphology, time expressions, parsing
• Several demonstration systems– Interactive CLIR (two systems)– Cross-language QA (English Q, Translated A)– Machine translation (+ Translation elicitation)– Cross-document entity tracking
![Page 7: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/7.jpg)
Hindi Participants
Alias-I
UC
Berkeley
BB
N
CM
U
CU
NY
Johns Hopkins
IBM ISI
LDC
MIT
RE
NY
U
SP
AW
AR
U. S
heffield
U. M
assachusetts
U. M
aryland
ResourceGeneration
Detection
Extraction
Summarization
Translation
![Page 8: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/8.jpg)
TranslationDetection
Extraction
Summarization
BooksWeb
Books
WebPeople
Lexicons
Corpora
Time
ResourceHarvesting
Systems
ResearchResults
CaptureProcess Knowledge
Innovation Cycle
Coordination
StrategyPushOrganizeTalk
![Page 9: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/9.jpg)
![Page 10: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/10.jpg)
The Synchronization Challenge
![Page 11: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/11.jpg)
Cebuano MT Results
0 2 4 6 8 10 12
DDC
DCNDCNB
DBNDB
D5CN5BMDCNBM
DCMDBC
D5CN10BMDCNMDBCMDBNM
DNDBMDNM
DMDCN5BM
BLEU (%)
BibleCebuano bookDictMelamedNews
![Page 12: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/12.jpg)
Cebuano Interactive CLIR
• Starting Point: iCLEF 2002 system (German)– Interface: “synonyms”/examples (parallel)/MT– Back end: InQuery/Pirkola’s method
• 3-day porting effort– Cebuano indexing (no stemming)– One-best gloss translation (bilingual term list)
• Informal Evaluation– 2 Cebuano native speakers (at ISI)
![Page 13: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/13.jpg)
Hindi syntax is generally very “regular”• Subject – Object – Verb is the preferred order
– John saw Mary. = जॉ�न न� मे�री� को दे�खा ।• Presence of (occasionally deleted) case markers
often permit reordering– John saw Mary. = मे�री� को जॉ�न न� दे�खा ।
• English (or western) punctuation is pervasive in many modern texts– John said, “ I am here ” = जॉ�न न� कोहा , “ मे� यहा � हूँ�
”
• The subject may be omitted in some contexts– A: Where is John? B: [He] went home.– अ: जॉ�न कोहा � हा�? ब: [वहा] घरी चला गय ।
![Page 14: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/14.jpg)
Hindi Encoding• Text encoding for storage and transmission and text
rendering for display and printing are separated
• Which syllable constituents get their own code-points?– Several 8-bit encodings:
• After assigning a code point to each stand-alone vowel and full consonant, and to half-consonants and vowels within a syllable, spare code-points get used for assorted/frequent CC clusters.
– Unicode UTF-16: Only stand-alone vowels, full consonants and vowels within syllables have their own code-points. All half consonants are realized by a `full consonant + halant’ sequence
• Choice of the “grammar” for syllable construction and rendering?– Several 8-bit encodings write the code-points in display order,
simplifying the rendering program– Unicode writes it in pronunciation order, making for a
considerably more complex display program
![Page 15: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/15.jpg)
Hindi Week 1: Porting• Monday
– 2,973 BBC documents (UTF-8)– Batch CLIR (no stem, 2/3 known items rank 1)
• Tuesday– MIRACLE (“ITRANS”, gloss)– Stemmer (implemented from a paper)
• Wednesday– BBC CLIR collection (19 topic, known item)
• Friday:– Parallel text (Bible: 900k words, Web: 4k words) – Devanagari OCR system
![Page 16: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/16.jpg)
Hindi Weeks 2/3/4: Exploration• N-grams (trigrams best for UTF-8)• Relative Average Term Frequency (Kwok)• Scanned bilingual dictionary (Oxford)• More topics for test collection (29)• Weighted structured queries (IBM lexicon)• Alternative stemmers (U Mass, Berkeley)• Blind relevance feedback• Transliteration• Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)
![Page 17: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/17.jpg)
![Page 18: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/18.jpg)
![Page 19: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/19.jpg)
Formative Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25 30
Day (=Date-1)
Mea
n R
ecip
roca
l R
ank
![Page 20: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/20.jpg)
Transliteration
• Importance: Names, loan words– देक्षि�ण कोरिरीय (Dakshin Korea)
• Pronunciation crosswalk English->Hindi– English pronunciation (Festival)– Overgenerate Hindi characters (hand-built rules)
• Doctor => d aa k t ax r OR d ao k t ax r
– Rank n-best using bigrams (Hindi name list)
• Treat as alternate translations for CLIR– Pirkola’s method
![Page 21: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/21.jpg)
Some Challenges
• Formative evaluation
• Synchronize variable-rate efforts– Soccer, not football
• Integration
• Capturing lessons learned– See the forest, not just the trees
![Page 22: 22 August 2003CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland.](https://reader038.fdocuments.us/reader038/viewer/2022110401/56649e035503460f94aedabc/html5/thumbnails/22.jpg)
For More Information
• TIDES Newsletter– Cebuano: April– Hindi: August
• Papers– NAACL/HLT Short paper– MT Summit (late Sep)– ACM TALIP Special Issue
• Demonstration systems– Contact individual sites