Machine Translation for Digital Library Content
Carnegie Mellon University
Jaime Carbonell et al
Background1. State of the Art in MT
2. The General MT Challenge
3. MT for Low-Density Languages
4. Active Crowd-Sourcing for MT
Relevant MT Paradigms
1. Statistical/Example-Based MT
2. Context-Based MT
3. Learning Transfer Rules
4. Active Data Collection
5th ICUDL – November 2009
2
The Multilingual Bookstack
English
ArabicSpanish
Chinese
+ Multiple other languages, including rare and ancient ones
Telegu
3
Multi-lingual Challenges Are Not New
The Great Library + Tower of Babel = ?
But the Technological Solutions Are New
4
Coping with Massive Multilinguality• Multilingual Content
– Scanning, indexing, interlingual metadata, …– Within scope of UDL
• Multilingual Access– Pre-translation upon acquisition?
• Need to store content N-times (N languages)
– On-demand translation upon query?• Needs fast translation, cross-lingual retrieval
• Requisite Technology– Accurate OCR, Cross-lingual search– Accurate machine translation, beyond common lang’s
5
Challenges for Machine Translation• Ambiguity Resolution
– Lexical, phrasal, structural• Structural divergence
– Reordering, vanishing/appearing words, …• Inflectional morphology
– Spanish 40+ verb conjugations, Arabic has more.• Training Data
– Bilingual corpora, aligned corpora, annotated corpora• Evaluation
– Automated (BLEU, METEOR, TER) vs HTER vs …
6
Context Needed to Resolve Ambiguity
Example: English Japanese Power line – densen (電線 ) Subway line – chikatetsu ( 地下鉄 )
(Be) on line – onrain (オンライン ) (Be) on the line – denwachuu (電話中 ) Line up – narabu (並ぶ ) Line one’s pockets – kanemochi ni naru (金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする ) Actor’s line – serifu (セリフ ) Get a line on – joho o eru (情報を得る )
Sometimes local context suffices (as above) n-grams help. . . but sometimes not
7
CONTEXT: More is Better
• Examples requiring longer-range context:– “The line for the new play extended for 3 blocks.”
– “The line for the new play was changed by the scriptwriter.”
– “The line for the new play got tangled with the other props.”
– “The line for the new play better protected the quarterback.”
• Challenges:– Short n-grams (3-4 words) insufficient
– Requires more general semantics Domain specificity
8
Low Density Languages
• 6,900 languages in 2000 – Ethnologue www.ethnologue.com/ethno_docs/distribution.asp?by=area
• 77 (1.2%) have over 10M speakers– 1st is Chinese, 5th is Bengali, 11th is Javanese
• 3,000 have over 10,000 speakers each• 3,000 may survive past 2100• 5X to 10X number of dialects• # of L’s in some interesting countries:
– Pakistan: 77, India 400– North Korea: 1, Indonesia 700
9
Some Linguistics Maps
10
Some (very) LD Languages in N. A.
Anishinaabe (Ojibwe, Potawatame, Odawa) Great Lakes
11
Additional Challenges for Rare and for Ancient Languages
• Morpho-syntactics is plentiful– Beyond inflection: verb-incorporation, agglomeration, …
• Data is scarce– Insignificant bilingual or annotated data
• Fluent computational linguists are scarce– Field linguists know LD languages best
• Standardization is scarce– Orthographic, dialectal, rapid evolution, …
12
Morpho-Syntactics & Multi-Morphemics
• Iñupiaq (North Slope Alaska, Lori Levin)– Tauqsiġñiaġviŋmuŋniaŋitchugut. – ‘We won’t go to the store.’
• Kalaallisut (Greenlandic, Per Langaard, p.c.)– Pittsburghimukarthussaqarnavianngilaq
– Pittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+naviar+nngit+v+IND+3SG
– "It is not likely that anyone is going to Pittsburgh"
13
Morphotactics in Iñupiaq
14
Type Token Curve for Mapudungun
0
20
40
60
80
100
120
140
0 500 1,000 1,500
Typ
es, i
n T
hous
ands
Tokens, in Thousands
Mapudungun Spanish
• 400,000+ speakers• Mostly bilingual• Mostly in Chile
• Pewenche• Lafkenche• Nguluche• Huilliche
1504/22/23 15
Paradigms for Machine Translation
Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (e.g. Pashto)
Target(e.g. English)
Transfer Rules
Direct: SMT, EBMT CBMT, …
16
Which MT Paradigms are Best? Towards Filling the Table
Large T Med T LD T
Large S SMT ??? ???
Med S ??? ??? ???
LD S ??? ??? ???
• DARPA MT: Large S Large T– Arabic English; Chinese English
17
An Evolutionary Tree of MT Paradigms
1950 20101980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Larger-Scale TMT
Context-Based MT
Statistical MT
Phrasal SMT
Stat Transfer
MT
Stat Syntax
MT
18
Parallel Text: Requiring Less is Better (Requiring None is Best )
• Challenge– There is just not enough to approach human-quality MT for major
language pairs (we need ~100X to ~10,000X)
– Much parallel text is not on-point (not on domain)
– LD languages or distant pairs have very little parallel text
• CBMT Approach [Abir, Carbonell, Sofizade, …]
– Requires no parallel text, no transfer rules . . .
– Instead, CBMT needs• A fully-inflected bilingual dictionary
• A (very large) target-language-only corpus
• A (modest) source-language-only corpus [optional, but preferred]
19
CMBT System
Parser
Target Language
Source Language
ParserN-gram Segmenter
Overlap-based Decoder
Bilingual Dictionary
INDEXED RESOURCES
Target Corpora
[Source Corpora]
N-GRAM BUILDERS(Translation Model)
Flooder(non-parallel text method)
Cross-LanguageN-gram Database
CACHE DATABASE
N-GRAM CONNECTOR
N-gram Candidates
Approved N-gram
Pairs
Stored N-gram
PairsGazetteers
Edge Locker
TTR
Substitution Request
20
Step 1: Source Sentence Chunking• Segment source sentence into overlapping n-grams via sliding window• Typical n-gram length 4 to 9 terms• Each term is a word or a known phrase• Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words)
S1 S2 S3 S4 S5 S6 S7 S8 S9
S1 S2 S3 S4 S5
S2 S3 S4 S5 S6
S3 S4 S5 S6 S7
S4 S5 S6 S7 S8
S5 S6 S7 S8 S9
21
Flooding Set
Step 2: Dictionary Lookup
T3-aT3-bT3-c
T4-aT4-bT4-cT4-dT4-e
T5-a T6-aT6-bT6-c
• Using bilingual dictionary, list all possible target translations for each source word or phrase
Source Word-String
T2-aT2-bT2-cT2-d
Target Word Lists
S2 S3 S4 S5 S6
Inflected Bilingual Dictionary
22
Step 3: Search Target Text
• Using the Flooding Set, search target text for word-strings
containing one word from each group
• Find maximum number of words from Flooding Set in minimum length word-string– Words or phrases can be in any order
– Ignore function words in initial step (T5 is a function word in this example)
T2-aT2-bT2-cT2-d
T3-aT3-bT3-c
T4-aT4-bT4-cT4-dT4-e
T5-a T6-aT6-bT6-cFlooding Set
23
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)
Step 3: Search Target Text (Example)
T2-aT2-bT2-cT2-d
T3-aT3-bT3-c
T4-aT4-bT4-cT4-dT4-e
T5-a T6-aT6-bT6-cFlooding Set
Target Corpus
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)
T3-b T(x) T2-d T(x) T(x) T6-c
Target Candidate 1
24
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)
Step 3: Search Target Text (Example)
T2-aT2-bT2-cT2-d
T3-aT3-bT3-c
T4-aT4-bT4-cT4-dT4-e
T5-a T6-aT6-bT6-cFlooding Set
Target Corpus
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)
T4-a T6-b T(x) T2-c T3-a
Target Candidate 2
25
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)
Step 3: Search Target Text (Example)
T2-aT2-bT2-cT2-d
T3-aT3-bT3-c
T4-aT4-bT4-cT4-dT4-e
T5-a T6-aT6-bT6-cFlooding Set
Target Corpus
T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-c T2-b T4-e T5-a T6-a
Target Candidate 3
Reintroduce function words after initial match (e.g. T5)
26
Scoring
---
---
---
Step 4: Score Word-String Candidates
• Scoring of candidates based on:– Proximity (minimize extraneous words in target n-gram precision)– Number of word matches (maximize coverage recall))– Regular words given more weight than function words
– Combine results (e.g., optimize F1 or p-norm or …)
T3-b T(x) T2-d T(x) T(x) T6-c
T4-a T6-b T(x) T2-c T3-a
T3-c T2-b T4-e T5-a T6-a
Proximity
3rd
1st
1st
Word Matches
3rd
2st
1st
“Regular” Words
3rd
1st
1st
Total Scoring
3rd
2nd
1st
Target Word-String Candidates
27
T3-b T(x3) T2-d T(x5) T(x6) T6-c
T4-a T6-b T(x3) T2-c T3-a
T3-c T2-b T4-e T5-a T6-a
T(x2) T4-a T6-b T(x3) T2-c
T(x1) T2-d T3-c T(x2) T4-b
T(x1) T3-c T2-b T4-e
T6-b T(x11) T2-c T3-a T(x9)
T2-b T4-e T5-a T6-a T(x8)
T6-b T(x3) T2-c T3-a T(x8)
Step 5: Select Candidates Using Overlap(Propagate context over entire sentence)
Word-String 1Candidates
Word-String 2Candidates
Word-String 3Candidates
T(x2) T4-a T6-b T(x3) T2-c
T4-a T6-b T(x3) T2-c T3-a
T6-b T(x3) T2-c T3-a T(x8)
T3-c T2-b T4-e T5-a T6-a
T3-b T(x3) T2-d T(x5) T(x6) T6-cT3-b T(x3) T2-d T(x5) T(x6) T6-c
T4-a T6-b T(x3) T2-c T3-a
T(x1) T3-c T2-b T4-e
T3-c T2-b T4-e T5-a T6-a
T2-b T4-e T5-a T6-a T(x8)
28
Step 5: Select Candidates Using Overlap
T(x1) T3-c T2-b T4-e
T3-c T2-b T4-e T5-a T6-a
T2-b T4-e T5-a T6-a T(x8)
T(x2) T4-a T6-b T(x3) T2-c
T4-a T6-b T(x3) T2-c T3-a
T6-b T(x3) T2-c T3-a T(x8)
T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8)
T(x1) T3-c T2-b T4-e T5-a T6-a T(x8)
Best translations selected via maximal overlap
Alternative 1
Alternative 2
29
A (Simple) Real Example of Overlap
A United States soldier died and two others were injured Monday
A United States soldier
United States soldier died
soldier died and two others
died and two others were injured
two others were injured Monday
A soldier of the wounded United States died and other two were east Monday
N-grams generated
from Flooding
Systran
Flooding N-gram fidelity
Overlap Long range fidelity
N-grams connected via Overlap
30
System Scores
Based on same Spanish test set
0.4
0.5
0.6
0.7
0.8
BL
EU
SC
OR
ES
: 4
Ref
Trx
s
SystranSpanish
GoogleChinese
(’06 NIST)
CBMTSpanish
0.3GoogleArabic
(‘06 NIST)
0.3859
0.5137
0.5551 0.5610
SDLSpanish
0.7533
0.85
0.7447
CBMTSpanish
(Non-blind)
GoogleSpanish
’08 top lang
0.7189
Human Scoring Range
31
0.4
0.5
0.6
0.7
0.8
BL
EU
SC
OR
ES
0.3
Human Scoring Range
Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07 2008
0.85
.3743
.5670
.6144.6165
.6456
.6950
.5953.6129
Non-Blind
.6374.6393
.6267
.7059
.6694 .6929
Blind
Historical CBMT Scoring
.7354
.6645
.7365
.7276
.7447
.7533
32
An Example
• Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por
el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron
funcionarios militares estadounidenses
• CBMT: A United States soldier died and two others were injured monday by the
explosion of an improvised explosive device in the heart of Baghdad, American
military officials said.
• Systran: A soldier of the wounded United States died and other two were east
Monday by the outbreak from an improvised explosive device in the center of
Bagdad, said American military civil employees
BTW: Google’s translation is identical to CBMT’s
33
Stat-Transfer MT: Research Goals(Lavie, Carbonell, Levin, Vogel & Students)• Long-term research agenda (since 2000) focused on developing a
unified framework for MT that addresses the core fundamental weaknesses of previous approaches:– Representation – explore richer formalisms that can capture complex
divergences between languages– Ability to handle morphologically complex languages– Methods for automatically acquiring MT resources from available data
and combining them with manual resources– Ability to address both rich and poor resource scenarios
• Main research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)
04/22/23 33Stat-XFER
3404/22/23 Stat-XFER 34
Stat-XFER: List of Ingredients
• Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts
• SMT-Phrasal Base: Automatic Word and Phrase translation lexicon acquisition from parallel data
• Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages
• Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences
• Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants
• XFER + Decoder:– XFER engine produces a lattice of possible transferred structures at all levels
– Decoder searches and selects the best scoring combination
3504/22/23 35
Stat-XFER MT Approach
Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (e.g. Urdu)
Target(e.g. English)
Transfer Rules
Direct: SMT, EBMT
Statistical-XFER
36
Avenue Architecture
Mar 1, 2006AVENUE/LETRAS36
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
37
Syntax-driven Acquisition Process
Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:
1. Word-align the parallel corpus (GIZA++)
2. Parse the sentences independently for both languages
3. Tree-to-tree Constituent Alignment:a) Run our new Constituent Aligner over the parsed sentence pairsb) Enhance alignments with additional Constituent Projections
4. Extract all aligned constituents from the parallel trees
5. Extract all derived synchronous transfer rules from the constituent-aligned parallel trees
6. Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)
04/22/23 37Alon Lavie: Stat-XFER
38
PFA Node Alignment Algorithm Example
•Any constituent or sub-constituent is a candidate for alignment•Triggered by word/phrase alignments•Tree Structures can be highly divergent
38
39
PFA Node Alignment Algorithm Example
•Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)
•Resulting aligned nodes are highlighted in figure
•Transfer rules are partially lexicalized and read off tree.
39
40
Targeted Data Aquisition
• Elicitation Corpus for Structural Diversity– Field Linguist in a Box (Levin et al)– Automated Selection of Next Elicitations
• Active Learning – To select maximally informative sentences to translate
• E.g. with max freq n-grams in monolingual corpus
• E.g. maximally different from current bi-lingual corpus
– To select hand alignments– Use Amazon’s Mechanical Turk or Web-game
• Multi-source independent verification
– Elicit just bilingual dictionaries (for CBMT)
SourceLanguage
Corpus
ModelModel
Trainer
MT System
SS
Active
Learner
S,TS,T
Active Learning for MT
ExpertTranslator
Sampled corpus
Parallel corpus
S,T1S,T1
SourceLanguage
Corpus
ModelModel
Trainer
MT System
SS
ACT Framework
.
.
.
S,T2S,T2
S,TnS,Tn
Active Crowd Translation
SentenceSelection
TranslationSelection
Active Learning Strategy:Diminishing Density Weighted Diversity Sampling
43
|)(|
)]/(*[^)/(
)( )(
sPhrases
LxcounteULxP
Sdensity sPhrasesx
Lifx
Lifx
sPhrases
xcount
Sdiversity sPhrasesx
1
0
|)(|
)(*
)( )(
)()(
)(*)()1()(
2
2
SdiversitySdensity
SdiversitySdensitySScore
Experiments:Language Pair: Spanish-English
Iterations: 20 Batch Size: 1000 sentences eachTranslation: Moses Phrase SMT
Development Set: 343 sensTest Set: 506 sens
Graph:X: Performance (BLEU )
Y: Data (Thousand words)
Translation Selection from Crowd Data• Translation Reliability
– Compare translations for agreement with each other
– Flexible matching is the key
• Translator Reliability
• Translation Selection: • Maximal agreement criterion
45
Which MT Paradigms are Best? Some Tentative Answers
Large T Med T LD T
Large S SMTCMBT
SMTSXFER
RMT?
Med S CBMTSMT
SXFERSMT
???
LD S CBMTSXFER
SXFERRMT?
???
• Unidirectional: LD E (intelligence)• Bi-directional: LD E (theater ops)• LDA LDB
46
Test of Time: 3K years ago, none of the existing languages existed 3K years from now?
Top Related