The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of...
Transcript of The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of...
The AVENUE Project Data Elicitation System
Lori LevinLanguage Technologies Institute
School of Computer ScienceCarnegie Mellon University
Joint work with
• Dr. Jeff Good
• Dr. Robert Frederking
• Alison Alvarez
Outline
• The AVENUE MT project– Including a list of languages we have worked on
• The elicitation tool– Including which kinds of fonts it works for
• The elicitation corpus– Including which languages it has been translated into
• Tools for building and revising elicitation corpora
MT Approaches
Interlingua: introduce-self
Syntactic ParsingPronoun-acc-1-sg chiamare-1sg N
Semantic Analysis
Sentence Planning Text
Generation[np poss-1sg “name”] BE-pres N
SourceMi chiamo Lori
TargetMy name is Lori
Transfer Rules
Direct: SMT, EBMT
AVENUE: Automate Rule Learning
AVENUE Machine Translation System
Type informationSynchronous Context Free
RulesAlignments
x-side constraints
y-side constraints
xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)
((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)
Rule learning: Katharina Probst
AVENUE
• Rules can be written by hand or learned automatically.
• Hybrid– Rule-based transfer– Statistical decoder– Multi-engine combinations with SMT and EBMT
AVENUE systems(Small and experimental, but tested on unseen data)
• Hebrew-to-English – Alon Lavie, Shuly Wintner, Katharina Probst– Hand-written and automatically learned– Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules.
• Hindi-to-English – Lavie, Peterson, Probst, Levin, Font, Cohen, Monson– Automatically learned– Performs better than SMT when training data is limited
to 50K words
AVENUE systems(Small and experimental, but tested on unseen data)
• English-to-Spanish– Ariadna Font Llitjos– Hand-written, automatically corrected
• Mapudungun-to-Spanish – Roberto Aranovich and Christian Monson– Hand-written
• Dutch-to-English – Simon Zwarts– Hand-written
Outline
• The AVENUE MT projectThe elicitation tool
• The questionnaire
• Tools for building questionnaires
Elicitation
• Get data from someone who is– Bilingual – Literate
• With consistent spelling
– Not experienced with linguistics
English-Hindi Example
Elicitation Tool: Erik Peterson
English-Chinese Example
Note: Translator has to insert spaces between words in Chinese.
English-Arabic Example
Outline
• The AVENUE MT project
• The elicitation toolThe elicitation corpus
• Tools for building elicitation corpora
Size of Questionnaire
• Around 3200 sentences
• 20K words
EC Sample: clause level• Mary is writing a book for John.• Who let him eat the sandwich?• Who had the machine crush the
car?• They did not make the policeman
run.• Mary had not blinked.• The policewoman was willing to
chase the boy.• Our brothers did not destroy files.• He said that there is not a manual.• The teacher who wrote a textbook
left.• The policeman chased the man
who was a thief.• Mary began to work.
• Tense, aspect, transitivity, animacy
• Questions, causation and permission
• Interaction of lexical and grammatical aspect
• Volitionality
• Embedded clauses and sequence of tense
• Relative clauses
• Phase aspect
EC Sample: noun phrase level
• The man quit in November.• The man works in the
afternoon.• The balloon floated over the
library.• The man walked over the
platform.• The man came out from
among the group of boys.• The long weekly meeting
ended.• The large bus to the post office
broke down.• The second man laughed.• All five boys laughed.
• Temporal and locative meanings• Quantifiers• Numbers• Combinations of different types of
modifers– My book
• Possession, definiteness– A book of mine
• Possession, indefiniteness
Organization into Minimal Pairs
srcsent: Tú caíste.tgtsent: Eymi ütrünagimi.aligned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) fell
srcsent: Tú estás cayendo.tgtsent: Eymi petu ütrünagimi.aligned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) are falling
srcsent: Tú caíste .tgtsent: Eymi ütrunagimi.aligned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del singular]comment: You (Mary) fell
Feature Detection: Spanish
The girl saw a red book.((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))La niña vió un libro rojo
A girl saw a red book((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))Una niña vió un libro rojo
I saw the red book((1,1)(2,2)(3,3)(4,5)(5,4))Yo vi el libro rojo
I saw a red book.
((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo
Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: noMarked-on-dependent: yesMarked-on-governor: noMarked-on-other: noAdd/delete-word: noChange-in-alignment: no
Feature Detection: Chinese
A girl saw a red book.
((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
有 一个 女人 看见 了 一本 红色 的 书 。
The girl saw a red book.
((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
女人 看见 了 一本 红色的 书
Feature: definiteness
Values: definite, indefinite
Function-of-*: subject
Marked-on-head-of-*: no
Marked-on-dependent: no
Marked-on-governor: no
Add/delete-word: yes
Change-in-alignment: no
Feature Detection: Chinese
I saw the red book((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
红色的 书, 我 看见 了
I saw a red book.((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))我 看见 了 一本 红色的 书 。
Feature: definitenesValues: definite, indefiniteFunction-of-*: objectMarked-on-head-of-*: noMarked-on-dependent: noMarked-on-governor: noAdd/delete-word: yesChange-in-alignment: yes
Feature Detection: Hebrew
A girl saw a red book.((2,1) (3,2)(5,4)(6,3))
ראתה ספר אדוםילדה
The girl saw a red book((1,1)(2,1)(3,2)(5,4)(6,3))
ראתה ספר אדוםהילדה
I saw a red book.((2,1)(4,3)(5,2))
אדוםספרראיתי
I saw the red book.((2,1)(3,3)(3,4)(4,4)(5,3))
האדוםהספרראיתי את
Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: yesMarked-on-dependent: yesMarked-on-governor: noAdd-word: noChange-in-alignment: no
Feature Detection Feeds into…
• Corpus Navigation: which minimal pairs to pursue next.– Don’t pursue gender in Mapudungun– Do pursue definiteness in Hebrew
• Morphology Learning:– Morphological learner identifies the forms of the morphemes– Feature detection identifies the functions
• Rule learning:– Rule learner will have to learn a constraint for each morpho-
syntactic marker that is discovered• E.g., Adjectives and nouns agree in gender, number, and definiteness
in Hebrew.
Languages
• The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program.
• Translated (by LDC) into:– Thai– Bengali
• Plans to translate into:– Seven “strategic” languages per year for five years.
• As one small part of a language pack (BLARK) for each language.
Languages
• Spanish version in progress at New Mexico State University (Helmreich and Cowie)– Plans to translate into Guarani
• Portuguese version in progress in Brazil (Marcello Modesto)– Plans to translate into Karitiana
• 200 speakers
• Plans to translate into Inupiaq (Kaplan and MacLean)
Previous Elicitation Work
• Pilot corpus– Around 900 sentences– No feature structures
• Mapudungun– Two partial translations
• Quechua– Three translations
• Aymara– Seven translations
• Hebrew• Hindi
– Several translations• Dutch
Feature Structures
• The EC is actually a corpus of feature structures that happen to have English or Spanish sentences attached to them.
Bengali example with feature structure
srcsent: The large bus to the post office broke down. context: tgtsent:
((actor ((modifier ((mod-role mod-descriptor)(mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific)(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)(np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Why feature structures?
• Decide what grammatical meaning to elicit.
• Represent it in a feature structure.
• Formulate an English or Spanish sentence that expresses that meaning.– We can use the same corpus of feature
structures for several elicitation languages
• Have the informant translate it.
Grammatical meanings vs syntactic categories
• Features and values are based on a collection of grammatical meanings– Many of which are similar to the
grammatemes of the Prague Treebanks
Grammatical Meanings
YES• Semantic Roles• Identifiability• Specificity• Time
– Before, after, or during time of speech
• Modality
NO• Case• Voice• Determiners• Auxiliary verbs
Grammatical Meanings
YES• How is identifiability
expressed?– Determiner– Word order– Optional case marker– Optional verb agreement
• How is specificity expressed?
• How are generics expressed?
• How are predicate nominals marked?
NO• How are English
determiners translated?– The boy cried.– The lion is a fierce beast.– I ate a sandwich.– He is a soldier.
• Il est soldat.
Argument Roles
• Actor
• Undergoer
• Predicate and predicatee– The woman is the manager.
• Recipient– I gave a book to the students.
• Beneficiary– I made a phone call for Sam.
Why not subject and object?
• Languages use their voice systems for different purposes.
• Mapudungun obligatorily uses an inverse marked verb when third person acts on first or second person.– Verb agrees with undergoer– Undergoer exhibits other subjecthood properties– Actor may be object.
• Yes: How are actor and undergoer encoded in combination with other semantic features like adversity (Japanese) and person (Mapudungun)?
• No: How is English voice translated into another language?
Argument Roles
• Accompaniment– With someone– With pleasure
• Material– (out) of wood
• About 20 more roles – From the Lingua checklist; Comrie & Smith (1977)– Many also found in tectogrammatical representations in the
Prague Treebanks
• Around 80 locative relations– From Lingua checklist
• Many temporal relations
Noun Phrase Features
• Person• Number• Biological gender• Animacy• Distance (for deictics)• Identifiability• Specificity• Possession• Other semantic roles
– Accompaniment, material, location, time, etc.
• Type– Proper, common, pronoun
• Cardinals• Ordinals• Quantifiers• Given and new
information– Not used yet because of
limited context in the elicitation tool.
Clause level features
• Tense• Aspect
– Lexical, grammatical, phase
• Type– Declarative, open-q,
yes-no-q
• Function– Main, argument,
adjunct, relative
• Source– Hearsay, first-hand,
sensory, assumed
• Assertedness– Asserted,
presupposed, wanted
• Modality– Permission, obligation– Internal, external
Other clause types(Constructions)
• Causative– Make/let/have someone do something
• Predication– May be expressed with or without an overt copula.
• Existential– There is a problem.
• Impersonal– One doesn’t smoke in restaurants in the US.
• Lament– If only I had read the paper.
• Conditional• Comparative• Etc.
Outline
• The AVENUE MT project
• The elicitation tool
• The elicitation corpusTools for elicitation corpora
Mar 1, 2006
Tools for Creating Elicitation Corpora
List of semantic features and values
The Corpus
Feature Maps: which combinations of features and values are of interest
…Clause-Level
Noun-Phrase
Tense & Aspect Modality
Feature Structure Sets
Feature Specification
Reverse Annotated Feature Structure Sets: add English sentences
Smaller CorpusSampling
XML SchemaXSLT Script
Mar 1, 2006
Tools for Creating Elicitation Corpora
List of semantic features and values
The Corpus
Feature Maps: which combinations of features and values are of interest
…Clause-Level
Noun-Phrase
Tense & Aspect Modality
Feature Structure Sets
Feature Specification
Reverse Annotated Feature Structure Sets: add English sentences
Smaller CorpusSampling
Combination Formalism
Mar 1, 2006
Tools for Creating Elicitation Corpora
List of semantic features and values
The Corpus
Feature Maps: which combinations of features and values are of interest
…Clause-Level
Noun-Phrase
Tense & Aspect Modality
Feature Structure Sets
Feature Specification
Reverse Annotated Feature Structure Sets: add English sentences
Smaller CorpusSampling
Feature Structure Viewer
Mar 1, 2006
Tools for Creating Elicitation Corpora
List of semantic features and values
The Corpus
Feature Maps: which combinations of features and values are of interest
…Clause-Level
Noun-Phrase
Tense & Aspect Modality
Feature Structure Sets
Feature Specification
Reverse Annotated Feature Structure Sets: add English sentences
Smaller CorpusSampling
Feature Specification
• Defines Features and their values
• Sets default values for features
• Specifies feature requirements and restrictions
• Written in XML
Feature SpecificationFeature: c-copula-type
(a copula is a verb like “be”; some languages do not have copulas)Values
copula-n/a Restrictions: 1. ~(c-secondary-type secondary-copula)Notes:
copula-role Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler"
copula-identity Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "Clark Kent is Superman" "Sam is the teacher"
copula-location Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification.
copula-description Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A description is an attribute. "The children are happy." "The books are long."
Feature Maps
• Some features interact in the grammar– English –s reflects person and number of the subject and tense of
the verb.– In expressing the English present progressive tense, the auxiliary
verb is in a different place in a question and a statement:• He is running.
• Is he running?
• We need to check many, but not all combinations of features and values.
• Using unlimited feature combinations leads to an unmanageable number of sentences
Feature Combination Template((predicatee((np-general-type pronoun-type common-
noun-type)(np-person person-first person-second
person-third)(np-number num-sg num-pl)(np-biological-gender bio-gender-male bio-
gender-female)))
{[(predicate ((np-general-type common-noun-type)
(np-person person-third)))(c-copula-type role)][(predicate ((adj-general-type quality-type)(c-copula-type attributive)))][(predicate ((np-general-type common-
noun-type)(np-person person-third)(c-copula-type identity)))]}
(c-secondary-type secondary-copula) (c-polarity #all)
(c-general-type declarative)(c-speech-act sp-act-state)(c-v-grammatical-aspect gram-aspect-
neutral)(c-v-lexical-aspect state)(c-v-absolute-tense past present future)(c-v-phase-aspect durative))
Summarizes 288 feature structures, which are automatically generated.
Adding Sentences to Feature Structures
srcsent: Mary was not a leader.context: Translate this as though it were spoken to a peer co-
worker;
((actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np- specificity specific)…))
(pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity- neutral)…))
(c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical-aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase-aspect phase-aspect-neutral) (c-general-type declarative-clause)(c-polarity polarity-negative)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)…)
Difficult Issues in Adding Sentences
• Have to remember that the grammatical meanings don’t correspond exactly to English morphemes.– Identifiability and specificity vs the and a– Modality, tense, aspect vs auxiliary verbs
• The meaning has to be clear to a translator.– If English is going to be the source language for
translation, the clearest way to say something may not be the most common way it is said in real text or conversation.
Hard Problems
• Expressing meanings that are not grammaticalized in English.– Evidentiality:
• He stole the bread.• Context: Translate this as if you do not
have first hand knowledge. In English, we might say, “They say that he stole the bread” or “I hear that he stole the bread.”
Hard Problems
• Reverse annotating things that can be said in several ways in English.– Impersonals:
• One doesn’t smoke here.• You don’t smoke here.• They don’t smoke here.• There’s no smoking here.• Credit cards aren’t accepted.
– Problem in the Reflex corpus because space was limited.
Evaluation
• Current funding has not covered evaluation of the questionnaire.– Except for informal observations as it was
translated into several languages.
• Does it elicit the meanings it was intended to elicit?– Informal observation: usually
• Is it useful for machine translation?
Navigation
• Currently, feature combinations are specified by a human.
• Plan to work in active learning mode.– Build seed questionnaire– Translate some data– Do some learning– Identify most valuable pieces of information to get
next– Generate an RTB for those pieces of information– Translate more– Learn more– Generate more, etc.
Summary
• Feature Specification: – lists features and values – Grammatical meanings
• Feature Combinations
• Set of Feature Structures
• Add English or Spanish Sentences
• Get a translation and word alignment from a bilingual, literate informant