Data Collection and Language Technologies for Mapudungun
description
Transcript of Data Collection and Language Technologies for Mapudungun
![Page 1: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/1.jpg)
Data Collection and Language Technologies for Mapudungun
Lori Levin, Rodolfo Vega,
Jaime Carbonell, Ralf Brown,
Alon LavieLanguage Technologies Institute
Carnegie Mellon University
Eliseo CañulefInstituto de Estudios Indígenas
Universidad de La Frontera
Carolina HuenchullánMinisterio de Educación
Chile
Presented by Ariadna Font-LlitjosLanguage Technologies Institute
Carnegie Mellon University
![Page 2: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/2.jpg)
Overview
• Chile’s programs in bilingual and multicultural education
• The AVENUE project at Carnegie Mellon University
• The Mapudungun corpus• Plans for Example-Based Machine
Translation• Plans for Rule-Based Machine Translation
![Page 3: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/3.jpg)
Bilingual and Intercultural Education in Chile
• Eight ethnic groups: Mapuche, Aymara, Rapa Nui (Pascuense), Likay Antai, Quechua, Colla, Kawashkar (Alacalufe), Yamana (Yagan).
• Make education culturally and linguistically relevant.
• Languages of instruction are native language and second language (Spanish).
• Community involvement in curriculum design.
![Page 4: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/4.jpg)
AVENUE: Automatic Voice Enabled Natural language Understanding Environment
• Affordable machine translation for languages with scarce resources.– No large corpus in electronic form– Few or no native speakers trained in
computational linguistics
![Page 5: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/5.jpg)
AVENUE: Omnivorous MT
• AVENUE can consume whatever resources are available– EBMT: if a parallel corpus is available– Human-Engineered MT: if a human
computational linguist is available– Seeded Version Space Learning for automatic
acquisition of transfer rules: if no corpus or computational linguist is available
![Page 6: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/6.jpg)
Mapudungun
• Language of the Mapuche– Over 900,000 Mapuche in Chile and Argentina
• Words contain several morphemes including multiple open class items.
• Still spoken by a majority of Mapuche• Still spoken as a first language• Competing orthographies• Some vocabulary loss• Some written literature, newsletters and textbooks
![Page 7: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/7.jpg)
The Mapudungun Corpora
• First step toward:– Corpus-based machine translation– Authentic corpus for instructional purposes
• Written corpus
• Spoken corpus
![Page 8: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/8.jpg)
The Written Mapudungun Corpus
• Existing texts were entered in electronic form and translated into Spanish:– Memorias de Pascual Coña: the life story of a
Mapuche leader written by Ernesto Wilhelm de Moessbach.
– Las Ultimas Familias by Tomás Guevara.
– Nuestros Pueblos newspaper published by Corporación Nacional de Desarrollo Indígena (CONADI).
• Total of around 200,000 words
![Page 9: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/9.jpg)
The Spoken Mapudungun Corpus
• Recorded with Sony DAT recorder and digital stereo microphone.
• Downloaded with CoolEdit
• Transcribed with TransEdit– Alignment of audio and transcript for speech
recognition
![Page 10: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/10.jpg)
The Spoken Mapudungun Corpus
• All sessions were scheduled and recorded by a native speaker interviewer
• Subject matter: primary and preventive health– Limited domain for higher quality machine
translation– People were asked to describe their experiences
with an illness and how it was treated by modern or traditional medicine
![Page 11: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/11.jpg)
The Spoken Mapudungun Corpus
• Speakers: – 21-75 years old; most 40-65– Fully native speakers– Some auxiliary nurses for rural areas in Chilean
Public health system– Some machi:
• Did not reveal specialized knowledge
![Page 12: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/12.jpg)
The Mapudungun Spoken Corpus
• Dialects:– Lafkenche, Nguluche, Pewenche – Williche will be recorded at a later stage of the
project• more morpho-syntactic differences from the other
dialects
![Page 13: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/13.jpg)
The Mapudungun Spoken Corpus
• Orthography:– Pan-dialectal:
• 32 phones
• Some are dialectal variants of each other
– Supra-dialectal• 28 letters covering the 32 phones
– Typable on Spanish keyboard with some diacritics such as apostrophes
– Use Spanish letters for phonemes that sound like Spanish phonemes
![Page 14: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/14.jpg)
Plans for Machine Translation
• Example-Based MT
• Seeded Version Space Learning for automated acquisition of transfer rules
![Page 15: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/15.jpg)
Example-Based MT
• Insert one of Ralf’s slides
![Page 16: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/16.jpg)
Automated Acquisition of Transfer Rules
• Elicitation Tool
• Seeded Version Space Learning
• Run-time transfer system for MT
![Page 17: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/17.jpg)
Chinese-English Transfer Rule for Yes-No Questions
S::S : [NP VP MA] -> [AUX NP VP]((x1::y2) ; set alignments (x2::y3)
((x0 subj) = x1) ; create Chinese f-structure ((x0 subj case) = nom) ; Chinese has no case, so add it ((x0 act) = quest) ; set speech act to question (x0 = x2) ; create Chinese f-structure
((y1 form) = do) ; set base form of AUX to "do" ; proper form will be selected based on subj-verb agreement
((y3 vform) =c inf) ; verb must be infinitive ((y1 agr) = (y2 agr)) ; subject and "do" must agree)
![Page 18: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/18.jpg)
Example of Seed Rule and Generalization
• Pair 1: the man::der mann• Pair 2: the woman::die frau
![Page 19: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/19.jpg)
Seed Rule 1 Seed Rule 2 Generalization
Det N Det N Det N Det N Det N Det N
X1::Y1 X1::Y1 X1::Y1
X2::Y2 X2::Y2 X2::Y2((X1 AGR) = *3-SING) ((X1 AGR) = *3-SING) ((X1 AGR) = *3-SING)
((X1 DEF) = *DEF) ((X1 DEF) = *DEF) ((X1 DEF) = *DEF)
((X2 AGR) = *3-SING) ((X2 AGR) = *3-SING) ((X2 AGR) = *3-SING)
((X2 COUNT) = +) ((X2 COUNT) = +) ((X2 COUNT) = +)
((Y1 AGR) = *3-SING) ((Y1 AGR) = *3-SING) ((Y1 AGR) = *3-SING)
((Y1 CASE) = *NOM) ((Y1 CASE) =
(*NOT* *GEN *DAT))
((Y1 DEF) = *DEF) ((Y1 DEF) = *DEF) ((Y1 DEF) = *DEF)
((Y2 GENDER) = *M) ((Y2 GENDER) = *F) ((Y2 GENDER) = *F)
((Y2 AGR) = *3-SING) ((Y2 AGR) = *3-SING) ((Y2 AGR) = *3-SING)
((Y2 CASE) = *NOM)
((Y2 GENDER) = *M) ((Y2 GENDER) = *F) ((Y2 GENDER) = (Y1 GENDER))
![Page 20: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/20.jpg)
Elicitation Tool
![Page 21: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/21.jpg)
Elicitation Process
• Bilingual informant
• Literate in the elicitation language and the elicited language
• Translate sentences
• Align words
![Page 22: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/22.jpg)
Elicitation Corpus: ExcerptHe has sold both of his cars. English promptEl ha vendido sus dos automóviles Spanish promptfey weluiñi epu awtu Mapudungun provided by informant He can move both of his thumbs. El puede mover sus dos pulgaresfey pepi newüleliñi epu fütrarumechangüll He loves both of his sisters. El ama a sus dos hermanas fey poyeyñi epu deya He loves both of his brothers. El ama a sus dos hermanos fey poyeyñi epu peñi
![Page 23: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/23.jpg)
Elicitation Corpus• Compositional:
– Small phrases are elicited first and then are combined into larger phrases
– For learnability
• Minimal Pairs:– Sentences that differ in only one feature (e.g.,
number of the subject)– For automatic feature detection
• If the minimal pair differs only in the number of the subject, and the verbs are different in the two sentences, the language may have agreement in number between subjects and verbs.
![Page 24: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/24.jpg)
Elicitation Corpus: Current Coverage
• 864 Sentences (pilot corpus)• Transitive and intransitive sentences• Animate and inanimate subjects and objects• Definite and indefinite subjects and objects• Present/ongoing and past/completed• Singular, plural, and dual nouns• Simple noun phrases with definiteness, modifiers• Possessive noun phrases
![Page 25: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/25.jpg)
Elicitation Corpus: Future Work
• Probst and Levin (2002) – Pitfalls of automated elicitation
• Automatic Branching and skipping:– Automatically skip parts of the corpus
depending on what features have been detected
![Page 26: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/26.jpg)
Status of automated rule learning
• Preliminary results – Learned some compositional rules for German
• Current work:– Interaction of compositional rules– Seed rule generation– Generalization and verification of seed rule
hypothesis
![Page 27: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/27.jpg)
Status of Transfer Rule System
• Preliminary experiments on Chinese-English MT
• Integrated into a multi-engine system with Example-Based MT
![Page 28: Data Collection and Language Technologies for Mapudungun](https://reader036.fdocuments.us/reader036/viewer/2022062322/56814a42550346895db75e4a/html5/thumbnails/28.jpg)
Tools for Field Linguists?
• Can feature detection and automatically learned rules be useful to alert a field worker to possible interesting data?
• Can automated elicitation with branching and skipping be helpful?