Czech-English Word Alignment Ondřej Bojar ([email protected]), Magdalena Prokopová...

1
Czech-English Word Alignment Czech-English Word Alignment Ondřej Bojar Ondřej Bojar ( (obo @ cuni . cz ), Magdalena Prokopov ), Magdalena Prokopov á ( á (magda . prokopova @gmail.com ) ) Institute of Formal and Applied Linguistics, Institute of Formal and Applied Linguistics, ÚFAL MFF, ÚFAL MFF, Charles University in Prague Charles University in Prague Motivatio n Full text, acknowledgement and the list of references in the proceedings of LREC 2006. Manual Annotation Automatic Word Alignment Types of connections used to compare annotations: Possible, Sure, Phrasal Connection of Any Type Annotator A1 15,476 15,399 A2 16,631 16,246 Mismatch A1 but not A2 2,343 1,146 A2 but not A1 3,498 1,714 Relative mismatch 18.2 % 9.0 % Intersection (1-1) Union (n-n) Prec Rec AER Prec Rec AER Baseline 97.4 57.6 27.4 65.9 86.7 25.5 Lemmas 97.9 75.0 15.0 77.1 89.8 17.2 Lemmas + Numbers 97.9 75.2 14.8 77.5 89.9 17.0 Lemmas + Singletons backed off with POS 97.4 75.8 14.6 77.8 88.5 17.4 Humans GIZA++ Baseline Improved en cs en cs Problem s Problem s 14.3 15.5 14.3 15.5 Problem s OK 0.1 0.1 0.2 0.1 OK Problem s 38.6 35.7 25.2 25.0 OK OK 46.9 48.7 60.4 59.4 Problematic Words Problematic Parts of Speech English Czech English Czech 361 to 319 , 679 IN 1348 N 259 the 271 se 519 DT 1283 V 159 of 146 v 510 NN 661 R 143 a 112 na 386 PRP 505 P 124 , 74 o 361 TO 448 Z 107 be 61 že 327 VB 398 A 99 it 55 . 310 JJ 280 D 95 that 47 a 245 RB 192 J 84 in 41 bude 216 NNP 59 C 80 by 37 k 199 VBN 22 T Czech English Sentences 21,141 Running Words 475,719 494,349 Running Words without Punctuation 404,523 439,304 Baseline Vocabulary 57,085 30,770 Singletons 31,458 14,637 Lemmas Vocabulary 28,007 25,000 Singletons 13,009 11,873 Lemmas + Singletons Vocabulary 15,041 13,150 Singletons 12 2 Where GIZA++ Fails, Humans Were Often in Trouble, Too Details about the Prague Czech- English Dependency Treebank • Two human annotations compared against each other. • GIZA++ compared against golden alignments (i.e. merged human annotations). Out of all the positions where GIZA++ failed, 38% were problematic for humans. The improvement thanks to lemmatization is not observed on words that are difficult for humans anyway. • Source: Wall Street Journal section of the Penn Treebank • Translated sentence-by-sentence to Czech. Used twice (Cs->En and En->Cs) The two guessed alignments can be merged using union, intersection or possibly other techniques. Motivation to manually annotate word alignment: • to create evaluation data for automatic alignment methods • to learn more about inter-annotator agreement and the limits of the task •both annotators mark a sure connection required connection •one of the annotators chooses sure connection and the other any other connection type required connection •at least one of the annotators chooses any connection type allowed connection •otherwise connection not allowed Two annotators independently annotated 515 sentences using 3 main connection types: • the word has no counterpart (null, ) • the words can be possibly linked (possible, ) • the words are translations of each other (sure, ) Additionally, some segments could have been marked as phrasal translations: • whole phrases correspond, but not the individual words (phrasal, ) Top Ten Problematic Words and POSes Steps in statistical machine translation: Mismatch rate relatively high, but it reduces to a half if the differences in connection type are disregarded. Preprocessing of the input text such as lemmatization significantly reduces data sparseness (see the table Details about the PCEDT below) and helps to achieve better alignments: English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D - Adverb, J - Conjunction, C - Number, T - Particle Verbs and their belongings, including the negative particle. English articles in cases where the rule “connect to the Czech governing noun” cannot be clearly applied. Punctuation: commas are used more frequently in Czech, the dollar symbol ($) is almost always translated and thus rarely repeated in Czech. Most Frequent Problematic Cases Sentence- parallel corpus Automatic word alignment Phrase extraction Evaluation metrics: Precision penalizes superfluous connections (connections generated automatically but not even allowed), recall penalizes forgotten required connections. Alignment-error rate (AER) is a combination of precision and recall. GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word connected to n target words). The test set for GIZA++ was created by merging the two human annotations: The following table displays the percentage of tokens where there was a match (OK) or mismatch (Problems) in the respective languages: Phrase table ~Translation dictionary of multi- word expressions Word-to-word alignments Baseline (raw input text) Zisk se vyšvihl na 117 milion ů dolarů Lemmas zisk se- 1 vyšvihnou t na-1 117 milion dolar Lemmas + Numbers zisk se- 1 vyšvihnou t na-1 NUM milion dolar Lemmas + Singletons Backed off with POS zisk se- 1 VERB na-1 117 milion dolar Gloss Revenu e refl soared to 117 millio n dollar Results of automatic word alignment:

Transcript of Czech-English Word Alignment Ondřej Bojar ([email protected]), Magdalena Prokopová...

Page 1: Czech-English Word Alignment Ondřej Bojar (obo@cuni.cz), Magdalena Prokopová (magda.prokopova@gmail.com) obo@cuni.czmagda.prokopova@gmail.comobo@cuni.czmagda.prokopova@gmail.com.

Czech-English Word AlignmentCzech-English Word AlignmentOndřej BojarOndřej Bojar ( ([email protected]), Magdalena Prokopov), Magdalena Prokopová (á ([email protected]))

Institute of Formal and Applied Linguistics, Institute of Formal and Applied Linguistics, ÚFAL MFF, ÚFAL MFF, Charles University in PragueCharles University in Prague

MotivationMotivation

Full text, acknowledgement and the list of references in the proceedings of LREC 2006.

Manual AnnotationManual Annotation

Automatic Word AlignmentAutomatic Word Alignment

Types of connections used to compare annotations:

Possible, Sure, Phrasal

Connection of Any Type

Annotator A1 15,476 15,399

A2 16,631 16,246

Mismatch A1 but not A2 2,343 1,146

A2 but not A1 3,498 1,714

Relative mismatch 18.2 % 9.0 %

Intersection (1-1) Union (n-n)

Prec Rec AER Prec Rec AER

Baseline 97.4 57.6 27.4 65.9 86.7 25.5

Lemmas 97.9 75.0 15.0 77.1 89.8 17.2

Lemmas + Numbers 97.9 75.2 14.8 77.5 89.9 17.0

Lemmas + Singletons backed off with POS

97.4 75.8 14.6 77.8 88.5 17.4

Humans GIZA++ Baseline Improved

en cs en cs

Problems Problems 14.3 15.5 14.3 15.5

Problems OK 0.1 0.1 0.2 0.1

OK Problems 38.6 35.7 25.2 25.0

OK OK 46.9 48.7 60.4 59.4

Problematic Words Problematic Parts of Speech

English Czech English Czech

361 to 319 , 679 IN 1348 N

259 the 271 se 519 DT 1283 V

159 of 146 v 510 NN 661 R

143 a 112 na 386 PRP 505 P

124 , 74 o 361 TO 448 Z

107 be 61 že 327 VB 398 A

99 it 55 . 310 JJ 280 D

95 that 47 a 245 RB 192 J

84 in 41 bude 216 NNP 59 C

80 by 37 k 199 VBN 22 T

… … … …

Czech English

Sentences 21,141

Running Words 475,719 494,349

Running Words without Punctuation 404,523 439,304

Baseline Vocabulary 57,085 30,770

Singletons 31,458 14,637

Lemmas Vocabulary 28,007 25,000

Singletons 13,009 11,873

Lemmas+ Singletons

Vocabulary 15,041 13,150

Singletons 12 2

Where GIZA++ Fails, Humans Were Often in Trouble, TooWhere GIZA++ Fails, Humans Were Often in Trouble, Too

Details about the Prague Czech-English Dependency TreebankDetails about the Prague Czech-English Dependency Treebank

• Two human annotations compared against each other.• GIZA++ compared against golden alignments (i.e. merged human annotations).

Out of all the positions where GIZA++ failed, 38% were problematic for humans. The improvement thanks to lemmatization is not observed on words that are

difficult for humans anyway.

• Source: Wall Street Journal section of the Penn Treebank• Translated sentence-by-sentence to Czech.

Used twice (Cs->En and En->Cs) The two guessed alignments can be merged using union, intersection or possibly other

techniques.

Motivation to manually annotate word alignment:• to create evaluation data for automatic alignment methods• to learn more about inter-annotator agreement and the limits of the task

• both annotators mark a sure connection required connection• one of the annotators chooses sure connection and the other any other connection type required connection

• at least one of the annotators chooses any connection type allowed connection• otherwise connection not allowed

Two annotators independently annotated 515 sentences using 3 main connection types:• the word has no counterpart (null, )• the words can be possibly linked (possible, )• the words are translations of each other (sure, )

Additionally, some segments could have been marked as phrasal translations:• whole phrases correspond, but not the individual words (phrasal, )

Top Ten Problematic Words and POSesTop Ten Problematic Words and POSes

Steps in statistical machine translation:

Mismatch rate relatively high, but it reduces to a half if the differences in connection type are disregarded.

Preprocessing of the input text such as lemmatization significantly reduces data sparseness (see the table Details about the PCEDT below) and helps to achieve better alignments:

English Penn Treebank Tag-Set: IN - Preposition or subordinating conjunction, DT - Determiner, NN - Noun, common, singular or mass, PRP - Pronoun, personal, TO - to, VB - Verb, base form, JJ - Adjective, NNP - Noun, proper, singular, VBN - Verb, past participle.Czech Tag-Set: N - Noun, V - Verb, R - Preposition, P - Pronoun, Z - Punctuation, sentence border, A - Adjective, D - Adverb, J - Conjunction, C - Number, T - Particle

Verbs and their belongings,

including the negative particle.

English articles in cases where the rule “connect to the Czech governing noun” cannot be clearly applied.

Punctuation: commas are used more frequently in Czech, the dollar symbol ($) is almost always translated and thus

rarely repeated in Czech.

Most Frequent Problematic CasesMost Frequent Problematic Cases

Sentence-parallel corpus

Automatic word alignment

Phrase extraction

Evaluation metrics: Precision penalizes superfluous connections (connections generated automatically but not even allowed), recall penalizes forgotten required connections. Alignment-error rate (AER) is a combination of precision and recall.

GIZA++ (Och and Ney, 2003) automatically creates asymmetric alignments (1 source word connected to n target words).

The test set for GIZA++ was created by merging the two human annotations:

The following table displays the percentage of tokens where there was a match (OK) or mismatch (Problems) in the respective languages:

Phrase table~Translation dictionary of

multi-word expressionsWord-to-word

alignments

Baseline (raw input text) Zisk se vyšvihl na 117 milionů dolarů

Lemmas zisk se-1 vyšvihnout na-1 117 milion dolar

Lemmas + Numbers zisk se-1 vyšvihnout na-1 NUM milion dolar

Lemmas + Singletons Backed off with POS zisk se-1 VERB na-1 117 milion dolar

Gloss Revenue refl soared to 117 million dollar

Results of automatic word alignment: