Multilingual speech communities Language Choice in Multilingual Communities.
Massively Multilingual Language Technologies · Massively Multilingual Language Technologies...
Transcript of Massively Multilingual Language Technologies · Massively Multilingual Language Technologies...
![Page 1: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/1.jpg)
Jaime Carbonell (www.cs.cmu.edu/~jgc) Language Technologies Institute
Carnegie Mellon University
July 2016
Massively Multilingual
Language Technologies
Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, Pilot, Entrepreneur,…
![Page 2: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/2.jpg)
The Many Faces of Alex Waibel
Jaime Carbonell, CMU 2
“Official” Alex
“Worried” Alex
“Happier” Alex
“Happiest” Alex
![Page 3: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/3.jpg)
Werner Heisenberg v Alex Waibel
o Heisenberg Uncertainty Principle:
o It is impossible to measure location and momentum of an object precisely and simultaneously
o But… it can seem like we can measure them precisely if above the quantum level
o Waibel Uncertainty Principle:
o It is impossible to measure location and activity of Alex precisely and simultaneously
o But… it can seem like Alex is in multiple places doing multiple things at once
Jaime Carbonell, CMU 3
![Page 4: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/4.jpg)
Bridging the Linguistic Divide
o Until recently: Focus was Digital Divide n The “digirati”: internet connected, laptop,… n Smarphones à democratization in access
o Currently: The Linguistic Divide Rules n 6,900 languages: Google addresses 1% n Almost all information is in the 1% n How to democratize linguistic access?
Jaime Carbonell, CMU 4
![Page 5: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/5.jpg)
Multilingual Activities at CMU
o 1975: Speech research started at CMU (Harpy, Hearsay) o 1986: Started Center for Machine Translation à LTI in 1996 o 1989: Knowledge-Based MT (domain specific, high accuracy) o 1990: Multilingual speech (Sphinx, Janus) o 1991: Example-Based MT o 1992: Speech-to-speech MT (cStar) o 1995: Statistical MT (Janus) o 2000: MT for Low-Resource Languages o 2006: Context-Based MT o 2012: Linguistic-Core MT (for low resource languages) o 45 Languages: English, Spanish, French, Japanese, German,
Arabic, Korean, Chinese, Urdu/Hindi, Russian, Mapudungun, …
Jaime Carbonell, CMU 5
![Page 6: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/6.jpg)
6
Low Resource Languages
o 6,900 languages in 2012 – Ethnologue www.ethnologue.com/ethno_docs/distribution.asp?by=area
o Only 77 (1.2%) have over 10M speakers n 1st Chinese, 5th Arabic, 7th Bengali, 10th Javanese
o 3,000 have over 10,000 speakers each o 3,000 may not survive past 2100 o 5X to 10X number of dialects (35 for Arabic) o # of L’s in some interesting countries:
n Afghanistan: 52, Pakistan: 77, India 400 n North Korea: 1, Indonesia 700
![Page 7: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/7.jpg)
7
Some Linguistics Maps
![Page 8: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/8.jpg)
8
Some (very) LD Languages in the US
Anishinaabe (Ojibwe, Potawatame, Odawa) Great Lakes
![Page 9: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/9.jpg)
9
Challenges for General MT
o Ambiguity Resolution n Lexical, phrasal, structural
o Structural divergence n Reordering, vanishing/appearing words, …
o Inflectional morphology n Spanish 40+ verb conjugations, Arabic has more. n Mapudungun, Anupiac, … à agglomerative
o Training Data n Bilingual corpora, aligned corpora, annotated
corpora, bilingual dictionaries o Human informants
n Trained linguists, lexicographers, translators n Untrained bilingual speakers (e.g. crowd sourcing)
o Evaluation n Automated (BLEU, METEOR, TER) vs HTER vs …
![Page 10: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/10.jpg)
10
Context Needed to Resolve Ambiguity
Example: English à Japanese
Power line – densen (電線)
Subway line – chikatetsu (地下鉄)
(Be) on line – onrain (オンライン)
(Be) on the line – denwachuu (電話中)
Line up – narabu (並ぶ)
Line one’s pockets – kanemochi ni naru (金持ちになる)
Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする)
Actor’s line – serifu (セリフ)
Get a line on – joho o eru (情報を得る)
Sometimes local context suffices (as above) à n-grams help . . . but sometimes not
![Page 11: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/11.jpg)
11
CONTEXT: More is Better
o Examples requiring longer-range context: n “The line for the new play extended for 3 blocks.”
n “The line for the new play was changed by the scriptwriter.”
n “The line for the new play got tangled with the other props.”
n “The line for the new play better protected the quarterback.”
o Challenges: n Short n-grams (3-4 words) insufficient
n Requires more general syntax & semantics
![Page 12: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/12.jpg)
12
Additional Challenges for LD MT
o Morpho-syntactics is plentiful n Beyond inflection: verb-incorporation,
agglomeration, … o Data is scarce
n Insignificant bilingual or annotated data o Fluent computational linguists are scarce
n Field linguists know LD languages best o Standardization is scarce
n Orthographic, dialectal, rapid evolution, …
![Page 13: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/13.jpg)
13
Morpho-Syntactics & Multi-Morphemics
o Iñupiaq (North Slope Alaska, Lori Levin) n Tauqsiġñiaġviŋmuŋniaŋitchugut. n ‘We won’t go to the store.’
o Kalaallisut (Greenlandic, Per Langaard) n Pittsburghimukarthussaqarnavianngilaq n Pittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar
+naviar+nngit+v+IND+3SG n "It is not likely that anyone is going to Pittsburgh"
![Page 14: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/14.jpg)
14
Morphotactics in Iñupiaq
![Page 15: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/15.jpg)
15
Type-Token Curve for Mapudungun
0
20
40
60
80
100
120
140
0 500 1,000 1,500
Typ
es, i
n T
hous
ands
Tokens, in Thousands
Mapudungun Spanish
• 400,000+ speakers • Mostly bilingual • Mostly in Chile
• Pewenche • Lafkenche • Nguluche • Huilliche
![Page 16: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/16.jpg)
16 22/09/16 16
Paradigms for Machine Translation Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (e.g. Pashto)
Target (e.g. English)
Transfer Rules
Direct: SMT, EBMT CBMT, …
![Page 17: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/17.jpg)
17
Evolutionary Tree of MT Paradigms
1950 2010 1980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Large-scale TMT
Context-Based MT
Statistical MT
Phrasal SMT
Transfer MT w stat phrases
Stat MT on syntax struct.
![Page 18: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/18.jpg)
18 22/09/16
Stat-Transfer (STMT): List of Ingredients
o Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts
o SMT-Phrasal Base: Automatic Word and Phrase translation lexicon acquisition from parallel data
o Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages
o Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences
o Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants
o XFER + Decoder: n XFER engine produces a lattice of possible transferred structures at all levels n Decoder searches and selects the best scoring combination
![Page 19: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/19.jpg)
22/09/16 19
Stat-Transfer (ST) MT Approach Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (e.g. Urdu)
Target (e.g. English)
Transfer Rules
Direct: SMT, EBMT
Statistical-XFER
![Page 20: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/20.jpg)
20
Avenue/Letras STMT Architecture
AVENUE/LETRAS
Learning
Module
Learned Transfer Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 21: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/21.jpg)
21
Syntax-driven Acquisition Process Automatic Process for Extracting Syntax-driven Rules and
Lexicons from sentence-parallel data: 1. Word-align the parallel corpus (GIZA++) 2. Parse the sentences independently for both languages 3. Tree-to-tree Constituent Alignment:
a) Run our new Constituent Aligner over the parsed sentence pairs b) Enhance alignments with additional Constituent Projections
4. Extract all aligned constituents from the parallel trees 5. Extract all derived synchronous transfer rules from the
constituent-aligned parallel trees 6. Construct a “data-base” of all extracted parallel constituents
and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)
22/09/16
![Page 22: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/22.jpg)
PFA Node Alignment Algorithm Example
• Any cons:tuent or sub-‐cons:tuent is a candidate for alignment • Triggered by word/phrase alignments • Tree Structures can be highly divergent
![Page 23: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/23.jpg)
PFA Node Alignment Algorithm Example
• Tree-‐tree aligner enforces equivalence constraints and op:mizes over terminal alignment scores (words/phrases)
• Resul:ng aligned nodes are highlighted in figure
• Transfer rules are par:ally lexicalized and read off tree.
![Page 24: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/24.jpg)
24
The Setting o MURI Languages
n Kinyarwanda o Bantu (7.5M speakers)
n Malagasy o Malayo-Polynesian (14.5M)
n Swahili o Bantu (5M native, 150M
2nd/3rd)
Swahili
Anamwona “he is seeing him/her”
à Morpho-syntactics
![Page 25: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/25.jpg)
Source Language
Corpus
Model
Trainer
MT System
S
Active Learner
S,T
Active Learning for MT (Vamshi, Carbonell, Vogel)
Expert Translator
Monolingual source corpus
Parallel corpus
25 Jaime Carbonell, CMU
![Page 26: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/26.jpg)
S,T1
Source Language
Corpus
Model
Trainer
MT System
S
ACT Framework
. . .
S,T2
S,Tn
Active Crowd Translation
Sentence Selection
Translation Selection
26 Jaime Carbonell, CMU
![Page 27: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/27.jpg)
Active Learning Strategy: Diminishing Density Weighted Diversity Sampling
27
|)(|
)]/(*[^)/()( )(
sPhrases
LxcounteULxPSdensity sPhrasesx
∑∈
−∗
=
λ
LifxLifx
sPhrases
xcountSdiversity sPhrasesx
∉=
∈=
=∑
∈
10
|)(|
)(*)( )(
αα
α
)()()(*)()1()( 2
2
SdiversitySdensitySdiversitySdensitySScore
+
+=
ββ
Experiments: Language Pair: Spanish-English Batch Size: 1000 sentences each Translation: Moses Phrase SMT Development Set: 343 sens Test Set: 506 sens Graph: X: Performance (BLEU ) Y: Data (Thousand words)
![Page 28: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/28.jpg)
Translation Selection from Mechanical Turk
• Translator Reliability
• Translation Selection:
28 Jaime Carbonell, CMU
![Page 29: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/29.jpg)
ARIEL: Universal Typological Compendium
o Curation: ~2000 typological features across ~2200 languages
o Distance functions: Phylogenetic, Geopolitical, Typological
Lang
uage
s
Sources
22/09/16 ARIEL Project -- DARPA LORELEI Kickoff
29
![Page 30: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/30.jpg)
Lexical Transfer Example (Arabic à Swhahili)
22/09/16 ARIEL Project -- DARPA LORELEI Kickoff
30
![Page 31: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/31.jpg)
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
Attention history:
![Page 32: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/32.jpg)
Concluding Remarks
o Massively multilingual research (MT, speech, dialog) n Of crucial importance for humanity n Waibel has been at the very core
o Research Directions n Combining linguistics and Statistics n Paradigms for cross-language scalability n Transfer learning and proactive learning n Applications: disaster relief, education, eCom, …
Jaime Carbonell, CMU 32
![Page 33: Massively Multilingual Language Technologies · Massively Multilingual Language Technologies Building Bridges – Breaking Barriers A Tribute to Alex Waibel, Professor, ... " Mapudungun,](https://reader030.fdocuments.us/reader030/viewer/2022040802/5e3cbbfee4b84c5b642d1e4a/html5/thumbnails/33.jpg)
Jaime Carbonell, CMU 33
THANK YOU!