Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
3
Transcript of Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series...
Hybridity in MT: Experiments on the Europarl Corpus
Declan Groves24th May,
NCLT Seminar Series 2006
Outline
• Example-Based Machine Translation– Marker-Based EBMT
• Statistical Machine Translation– Phrasal Extraction
• Experiments:– Data Sources Used– EBMT vs PBSMT– Hybrid System Experiments
• Improving EBMT lexicon• Making use of merged data sets
• Conclusions• Future Work
Example-Based MT
• As with SMT, makes use of information extracted from sententially-aligned bilingual corpus. In general:
– SMT only uses parameters, throws away data– EBMT makes use of linguistic units directly.
• During Translation:1. Source side of bitext is searched for close matches2. Source-target sub-sentential links are determined3. Relevant target fragments retrieved and recombined
to derive final translation.
Example-Based MT: An Example
• Assumes an aligned bilingual corpus of examples against which input text is matched
• Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)
Example-Based MT: An Example
• Assumes an aligned bilingual corpus of examples against which input text is matched
• Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)
Given the Corpus
The shop is open on Monday Le magasin est ouvert Lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
Example-Based MT: An Example
• Identify useful fragments
Given the Corpus
The shop is open on Monday Le magasin est ouvert Lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
Example-Based MT: An Example
Isolate useful fragments
We can now translate:
on Monday LundiJohn went to Jean est allé àthe baker’s la boulangerie
Given the Corpus
The shop is open on Monday Le magasin est ouvert Lundi
John went to the swimming pool Jean est allé à la piscine
The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie
• Identify useful fragments• Recombination depends on nature of examples used
Marker-Based EBMT
“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes
which appear in a limited set of grammatical contexts and which signal that context.”
(Green, 1979)• Universal psycholinguistic constraint: languages are marked for syntactic
structure at surface level by close set of lexemes or morphemes• Use a set of closed-class marker words to segment aligned source and
target sentences during a pre-processing stage
Determiners <DET>
Quantifiers <QUANT>
Prepositions <PREP>
Conjunctions <CONJ>
Wh-Adverbs <WRB>
Possessive Pronouns <POSS>
Personal Pronouns <PRON>
Marker-Based EBMT
• Source-target sentence pairs are marked with their Marker categoriesEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP>
of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser
<DET>l’ effet <PREP> de <DET> la sélection
• Aligned source-target chunks are created by segmenting the sentences based on these marker tags along with cognate and word co-occurrence information:<PRON> you click apply: <PRON> vous cliquez sur appliquer<PREP> to view: <PREP> pour visualiser<DET> the effect: <DET> l’effet<PREP> of the selection: <PREP> de la sélection
Marker-Based EBMT
• Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon:
<PREP> to: <PREP> pour<LEX> view: <LEX> visualiser<LEX> effect: <LEX> effet<DET> the: <DET> l<PREP> of: <PREP> de
• In a final pre-processing stage, we produce a set of generalized marker templates by replacing marker words with their tags:
<PRON> click apply : <PRON> cliquez sur appliquer <PREP> view : <PREP> visualiser<DET> effect : <DET> effet <PREP> the selection : <PREP> la sélection
• Any marker tag pair can now be inserted at the appropriate tag location.
• More general examples add flexibility to the matching process and improve coverage (and quality).
Marker-Based EBMT
• During translation:– Resources are searched from maximal (specific
source-target sentence-pairs) to minimal context (word-for-word translation).
– Retrieved example translation candidates are recombined, along with their weights, based on source sentence order
– System outputs n-best list of translations
Phrase-Based SMT
• Translation models now estimate both word-to-word and phrasal translation probabilities (allowing in addition many-one and many-many word mappings)
• Phrases incorporate some idea of syntax.– Able to capture more meaningful relationships
between words within phrases
• In order to extract phrases, we can make use of word alignments
SMT Phrasal Extraction
• Perform word alignment in both source-target and target-source directions
• Take intersection of uni-directional alignments– Produces a set of highly confident word-alignments
• Extend the intersection iteratively into the union by adding adjacent alignments within the alignment space (Och & Ney 2003, Koehn et al., 2003).
• Extract all possible phrases from sentence pairs which correspond to these alignments (possibly including full sentences)
• Phrase probabilities can be calculated from relative frequencies– Phrases and their probabilities make up phrase translation
table (translation model).
Experiments: Data Resources
• Made use of French-English training and testing sets of the Europarl corpus (Koehn, 2005)
• Extracted training data from designated training sets, filtering based on sentence length and relative sentence length.
# sentence pairs
# words
78K 1.49M
156K 2.98M
322K 6.12M
• For testing, randomly extracted 5000 sentences from the Europarl common test set.– Avg. sentence lengths: 20.5 words (French), 19.0 words
(English)
EBMT vs PBSMT
• Compared the performance of our Marker-Based EBMT system against that of a PBSMT system built using:– Pharaoh Phrase-Based Decoder(Koehn, 2003) – SRI LM toolkit.– Refined alignment strategy (Och & Ney, 2003)
• Trained on incremental data sets, tested on 5000 sentence test set
• Performed translation for French-English and English-French
EBMT vs PBSMT: French-English
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
78K
156K
322K
• Doubling the amount of data improves performance across the board for both EBMT and PBSMT
• PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher
• PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set)
• Increasing amount of training data results in:– 3-5% increase in relative BLEU
for PBSMT– 6.2% to 10.3% relative BLEU
score improvement for EBMT
EBMT vs PBSMT: English-French
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
00.10.20.30.40.50.60.70.80.9
1
Bleu Prec Recall WER
78K
156K
322K
• PBSMTcontinues to outperform the EBMT system by some distance– e.g. 0.1933 vs 0.1488 BLEU score,
0.518 vs 0.4578 Recall for 322K data set
• Difference between scores is somewhat less for English-French for French-English– EBMT system performance is much
more consistent for both directions– PBSMT system performs 2% BLEU
score worse (10% relative) for English-French than French-English
• French-English is ‘easier’– Less agreement errors, problems
with boundary friction e.g. le -> the (French-English), the-> le, la, les, l’ (English-French)
Hybrid System Experiments
• Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++
• Number of Hybrid Systems– LEX-EBMT: Replaced EBMT lexicon with higher quality
PBSMT word-alignments, to lower WER– H-EBMT vs H-PBSMT: Merged PBSMT words and phrases
with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems
– EBMT-LM and H-EBMT-LM: Reranked the output of EBMT and H-EBMT systems using the PBSMT system’s equivalent language model
Hybrid Experiments: French-English
00.020.040.060.080.10.120.140.160.180.20.220.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
• Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU)
• Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT)– H-PBSMT system achieves higher BLEU score trained on 78K & 156K
compared with PBSMT system when trained on twice as much data.• The addition of the language model to the H-EBMT system helps
guide word order after lexical selection and thus improves results further
Hybrid Experiments: English-French
00.020.040.060.080.10.120.140.160.180.20.220.24
78K 156K 322K
EBMT
LEX-EBMT
H-EBMT
H-EBMT-LM
PBSMT
H-PBSMT
• We see similar results for English-French as for French-English
• Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline
• The H-EBMT system performs almost as well as the baseline system trained on over 4 times the amount of data
Conclusions
• In Groves & Way, 2005, we showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set
• This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system– Heterogenous Europarl vs. Homogeneous Sun data– Chunk coverage is lower on Europarl data set: 6% translations
produced using chunks alone (Sun) vs. 1% on Europarl– EBMT system considered 13 words on average for direct
translation• Significant improvements seen when using higher-quality lexicon• Improvements also seen when LM introduced
• H-PBSMT system able to outperform baseline PBSMT system
• Further gains to be made from hybrid corpus-based approaches
Future Work
• Automatic detection of Marker Words• Plan to increase levels of hybridity
– Code a simple EBMT decoder, factoring in Marker-Based recombination approach along with probabilities (rather than weights)
– Use exact sentence matching as in EBMT, along with statistical weighting of knowledge sources
– Integration of generalized templates into PBSMT system– Use Good-Turing methods to assign probabilities to fuzzy
matching. • Often a fuzzy chunk match may be more favourable to a word-for-
word translation
• Plan to code a robust, wide-coverage Statistical EBMT system– Make use of EBMT principles in a statistically-driven system.