Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina...
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density
Languages
Katharina Probst
April 5, 2002
Overview of the talk
Introduction and Motivation Overview of the AVENUE project Elicitation of bilingual data Rule Learning
Seed Generation Seeded Version Space Learning
Conclusions and Future Work
Overview of the talk
Introduction and Motivation Overview of the AVENUE project Elicitation of bilingual data Rule Learning
Seed Generation Seeded Version Space Learning
Conclusions and Future Work
Introduction and Motivation Basic idea: opening up Machine Translation to Languages to
minority languages Scarce resources for minority languages:
Bilingual text Monolingual text Target language grammar
Due to scarce resources, statistical and example-based methods will likely not perform as well
Our approach: A system that elicits necessary information about the target
language from a bilingual informant The elicited information is used in conjunction with any other
available target language information to learn syntactic transfer rules
System overview
User
Learning Module
ElicitationProcess
SVSLearning Process
TransferRules
Run-Time Module SLInput
SL Parser
TransferEngine
TLGenerator
EBMTEngine
UnifierModule
TLOutput
Overview of the talk
Introduction and Motivation Overview of the AVENUE project Elicitation of bilingual data Rule Learning
Seed Generation Seeded Version Space Learning
Conclusions and Future Work
Elicitation
Eliciation is the process of presenting a bilingual speaker with sets of sentences. The user translates the sentences and specifies how the words align
The elicitation process serves multiple purposes: Collection of data Feature detection
Feature Detection Feature detection is a process by which the learning module
answers questions such as “Does the target language mark number on nouns?”
The elicitation corpus is organized in minimal pairs, i.e. pairs of sentences that differ in only one feature. For example:
1. You (John) are falling. [2nd person m, subj, present tense]2. You (Mary) are falling. [2nd person f, subj, present tense]3. You (Mary) fell. [2nd person f, subj, past tense] Sentences 1 and 2 and sentences 2 and 3 are minimal pairs.
By comparing the translations for “you”, the system gets indications of whether plural is marked on nouns.
The results of feature detection will be used to guide the system in navigating through the elicitation corpus by eliminating parts used on Implicational Universals
The results will also be used by the rule learning module
More on the elicitation corpus
Eliciting data from bilingual informants entails a number of challenges:
1. The bilingual informant him/herself2. Morphology and the lexicon3. Learning grammatical features4. Compositional elicitation5. Elicitation of non-compositional data6. Verb subcategorization7. Alignment issues8. Bias towards the source language
Overview of the talk
Introduction and Motivation Overview of the AVENUE project Elicitation of bilingual data Rule Learning
Seed Generation Seeded Version Space Learning
Conclusions and Future Work
Rule Learning in the AVENUE project - Introduction
The goal is to semi-automatically (i.e. with the help of the user) infer syntactic transfer rules
Rule learning can be divided into two main steps: Seed Generation: The system produces an initial
“guess” at a transfer rule based on only one sentence. The produced rule is quite specific to the input sentence.
Version Space Learning: Here, the system takes the seed rules and generalize them.
Transfer rule formalism
A transfer rule (TR) consists of the following components:1. Source language sentence, Target language sentence that
the TR was produced from2. Word alignments3. Phrase information such as NP, S, …4. Part-of-Speech sequences for source and target language.5. X-side constraints, i.e. constraints on the source language.
These are used for parsing.6. Y-side constraints, i.e. constraints on the target language.
These are used for generation.7. XY-constraints, i.e. constraints that transfer features from the
source to the target language. These are used for transfer.
Seed Generation
Type of Information Source of Information
SL, TL sentence Informant
Alignment Informant
Phrase Information Elicitation corpus, same as SL on TL
SL POS sequence English parse (c,f)
TL POS sequence English parse, TL dictionary
X-side constraints English parse (f)
Y-side constraints English parse, list of projecting features, TL dictionary
XY constraints ---
A word on compositionality
Basic idea: if you produce a transfer rule for a sentence, and there already exist transfer rules that can translated parts of the sentence, why not use them?
Adjust the alignments, part-of-speech sequences, and the constraints
The trickiest part is to find new constraints that cannot be in the lower-level rule, but are necessary to translate correctly in the context of a sentence
Clustering
Seed rules are “clustered” into groups that warrant attempt to merge
Clustering criteria: POS sequences, Phrase information, Alignments
Main reason for clustering: divide the large version space into a number of smaller version spaces and run the algorithm on each version space separately
Possible danger: Rules that should be considered together (such as “the man”, “men”) will not be
The Version Space A set of seed rules in a cluster defines a version space as follows: The
seed rules form the specific boundary (S). A virtual rule with the same POS sequences, alignments, and phrase information, but no constraints forms the general boundary (G):
G boundary: virtual rule with no constraints
S boundary: seed rules
Generalizations of seed rules, less specific than rule in G
The partial ordering of rules in the version space
A rule TR2 is said to be strictly more general than another rule TR1 if the set of f-structures that satisfy TR2 are a superset of the set of f-structures that satisfy TR1. It is said to be equivalent to TR1 if the set of f-structures that satisfies TR1 is the same as the set of f-structures that satisfies TR2.
We have defined three operations that move a transfer rule to a strictly more general rule
Generalization operations
Operation 1: delete value constraint, e.g.((X1 agr) = *3pl) → NULL
Operation 2: delete agreement constraint, e.g.((X1 agr) = (X2 agr)) → NULL
Operation 3: merge two value constraints to an agreement constraint((X1 agr) = *3pl) , ((X2 agr) = *3pl)
→ ((X1 agr) = (X2 agr))
Merging two transfer rules
At the heart of the seeded version space learning algorithm is the merging of two transfer rules (TR1 and TR2) to a more general rule (TR3):
1. All constraints that are both in TR1 and TR2 are inserted into TR3 and removed from TR1 and TR2.
2. Perform all instances of Operation3 on TR1 and TR2 separately.
3. Repeat step 1.
Seeded Version Space Algorithm
1. Remove duplicate rules from the S boundary2. Try to merge each pair of transfer rules3. A merge is successful only if the CSet (set of
covered sentences, i.e. sentences that are translated correctly) of the merged rule is a superset of the union of the CSets of the two unmerged rules
4. Pick the successful merge that optimizes an evaluation criterion
5. Repeat until no more merges are found
Evaluating a set of transfer rules Initial thought: evaluate a merge based on the
“goodness” of the new rule, i.e. its CSet and based on the size of the rule set
Goal: maximize coverage and minimize set Currently: merges are only successful if there is no
loss in coverage, so size of rule set only criterion used
Future(1): Coverage should be measured on a test set
Future(2): Relax the constraint that a successful merge cannot result in loss of coverage
Overview of the talk
Introduction and Motivation Overview of the AVENUE project Elicitation of bilingual data Rule Learning
Seed Generation Seeded Version Space Learning
Conclusions and Future Work
Conclusions and Future Work
Novel approach to data-driven MT: less data, more encoded linguistic knowledge
Still in the first stages, so system is under heavy development and subject to major changes
Current work: compositionality Future work includes:
Expanding coverage Addressing (much) more complex constructions Eliminating some assumptions