Czech-English Word Alignment Ondřej Bojar ([email protected]), Magdalena Prokopová...
-
Upload
ashlyn-burke -
Category
Documents
-
view
214 -
download
2
Transcript of Czech-English Word Alignment Ondřej Bojar ([email protected]), Magdalena Prokopová...
Czech-English Word AlignmentCzech-English Word AlignmentOndřej BojarOndřej Bojar ( ([email protected]), Magdalena Prokopov), Magdalena Prokopová (á ([email protected]))
Institute of Formal and Applied Linguistics, Institute of Formal and Applied Linguistics, ÚFAL MFF, ÚFAL MFF, Charles University in PragueCharles University in Prague
MotivationMotivation
Full text, acknowledgement and the list of references in the proceedings of LREC 2006.
Manual AnnotationManual Annotation
Automatic Word AlignmentAutomatic Word Alignment
Types of connections used to compare annotations:
Possible, Sure, Phrasal
Connection of Any Type
Annotator A1 15,476 15,399
A2 16,631 16,246
Mismatch A1 but not A2 2,343 1,146
A2 but not A1 3,498 1,714
Relative mismatch 18.2 % 9.0 %
Intersection (1-1) Union (n-n)
Prec Rec AER Prec Rec AER
Baseline 97.4 57.6 27.4 65.9 86.7 25.5
Lemmas 97.9 75.0 15.0 77.1 89.8 17.2
Lemmas + Numbers 97.9 75.2 14.8 77.5 89.9 17.0
Lemmas + Singletons backed off with POS
97.4 75.8 14.6 77.8 88.5 17.4
Humans GIZA++ Baseline Improved
en cs en cs
Problems Problems 14.3 15.5 14.3 15.5
Problems OK 0.1 0.1 0.2 0.1
OK Problems 38.6 35.7 25.2 25.0
OK OK 46.9 48.7 60.4 59.4
Problematic Words Problematic Parts of Speech
English Czech English Czech
361 to 319 , 679 IN 1348 N
259 the 271 se 519 DT 1283 V
159 of 146 v 510 NN 661 R
143 a 112 na 386 PRP 505 P
124 , 74 o 361 TO 448 Z
107 be 61 že 327 VB 398 A
99 it 55 . 310 JJ 280 D
95 that 47 a 245 RB 192 J
84 in 41 bude 216 NNP 59 C
80 by 37 k 199 VBN 22 T
… … … …
Czech English
Sentences 21,141
Running Words 475,719 494,349
Running Words without Punctuation 404,523 439,304
Baseline Vocabulary 57,085 30,770
Singletons 31,458 14,637
Lemmas Vocabulary 28,007 25,000
Singletons 13,009 11,873
Lemmas+ Singletons
Vocabulary 15,041 13,150
Singletons 12 2
Where GIZA++ Fails, Humans Were Often in Trouble, TooWhere GIZA++ Fails, Humans Were Often in Trouble, Too
Details about the Prague Czech-English Dependency TreebankDetails about the Prague Czech-English Dependency Treebank
• Two human annotations compared against each other.• GIZA++ compared against golden alignments (i.e. merged human annotations).
Out of all the positions where GIZA++ failed, 38% were problematic for humans. The improvement thanks to lemmatization is not observed on words that are
difficult for humans anyway.
• Source: Wall Street Journal section of the Penn Treebank• Translated sentence-by-sentence to Czech.
Used twice (Cs->En and En->Cs) The two guessed alignments can be merged using union, intersection or possibly other
techniques.
Motivation to manually annotate word alignment:• to create evaluation data for automatic alignment methods• to learn more about inter-annotator agreement and the limits of the task
• both annotators mark a sure connection required connection• one of the annotators chooses sure connection and the other any other connection type required connection
• at least one of the annotators chooses any connection type allowed connection• otherwise connection not allowed
Two annotators independently annotated 515 sentences using 3 main connection types:• the word has no counterpart (null, )• the words can be possibly linked (possible, )• the words are translations of each other (sure, )
Additionally, some segments could have been marked as phrasal translations:• whole phrases correspond, but not the individual words (phrasal, )
Top Ten Problematic Words and POSesTop Ten Problematic Words and POSes
Steps in statistical machine translation:
Mismatch rate relatively high, but it reduces to a half if the differences in connection type are disregarded.
Preprocessing of the input text such as lemmatization significantly reduces data sparseness (see the table Details about the PCEDT below) and helps to achieve better alignments:
English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D - Adverb, J - Conjunction, C - Number, T - Particle
Verbs and their belongings,
including the negative particle.
English articles in cases where the rule “connect to the Czech governing noun” cannot be clearly applied.
Punctuation: commas are used more frequently in Czech, the dollar symbol ($) is almost always translated and thus
rarely repeated in Czech.
Most Frequent Problematic CasesMost Frequent Problematic Cases
Sentence-parallel corpus
Automatic word alignment
Phrase extraction
Evaluation metrics: Precision penalizes superfluous connections (connections generated automatically but not even allowed), recall penalizes forgotten required connections. Alignment-error rate (AER) is a combination of precision and recall.
GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word connected to n target words).
The test set for GIZA++ was created by merging the two human annotations:
The following table displays the percentage of tokens where there was a match (OK) or mismatch (Problems) in the respective languages:
Phrase table~Translation dictionary of
multi-word expressionsWord-to-word
alignments
Baseline (raw input text) Zisk se vyšvihl na 117 milionů dolarů
Lemmas zisk se-1 vyšvihnout na-1 117 milion dolar
Lemmas + Numbers zisk se-1 vyšvihnout na-1 NUM milion dolar
Lemmas + Singletons Backed off with POS zisk se-1 VERB na-1 117 milion dolar
Gloss Revenue refl soared to 117 million dollar
Results of automatic word alignment: