Machine Translation- 2 Autumn 2008 Lecture 17 4 Sep 2008.
-
Upload
theodora-fields -
Category
Documents
-
view
217 -
download
0
Transcript of Machine Translation- 2 Autumn 2008 Lecture 17 4 Sep 2008.
Machine Translation- 2
Autumn 2008
Lecture 17
4 Sep 2008
Statistical Machine Translation
Goal: Given foreign sentence f:
“Maria no dio una bofetada a la bruja verde”
Find the most likely English translation e: “Maria did not slap the green witch”
Statistical Machine Translation
Most likely English translation e is given by:
P(e|f) estimates conditional probability of any e given f
)|(maxarg fepe
What makes a good translation
Translators often talk about two factors we want to maximize:
Faithfulness or fidelity How close is the meaning of the translation to the
meaning of the original (Even better: does the translation cause the reader to
draw the same inferences as the original would have) Fluency or naturalness
How natural the translation is, just considering its fluency in the target language
Statistical MT Systems
Spanish BrokenEnglish
English
Spanish/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
Que hambre tengo yo
What hunger have I,Hungry I am so,I am so hungry,Have I that hunger …
I am so hungry
Statistical MT Systems
Spanish BrokenEnglish
English
Spanish/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
Que hambre tengo yo I am so hungry
TranslationModel P(s|e)
LanguageModel P(e)
Decoding algorithmargmax P(e) * P(s|e) e
Statistical MT: Faithfulness and Fluency formalized! Best-translation of a source sentence S:
Developed by researchers who were originally in speech recognition at IBM
Called the IBM model
ˆ T argmaxT fluency(T)faithfulness(T,S)
Three Problems for Statistical MT
Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e)
Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e)
Decoding algorithm Given a language model, a translation model, and a new
sentence f … find translation e maximizing P(e) * P(f | e)
Parallel Corpus
Example from DE-News (8/1/1996)
English German
Diverging opinions about planned tax reform
Unterschiedliche Meinungen zur geplanten Steuerreform
The discussion around the envisaged major tax reform continues .
Die Diskussion um die vorgesehene grosse Steuerreform dauert an .
The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .
Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .
Word-Level Alignments
Given a parallel sentence pair we can link (align) words or phrases that are translations of each other:
Parallel Resources
Newswire: DE-News (German-English), Hong-Kong News, Xinhua News (Chinese-English),
Government: Canadian-Hansards (French-English), Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish), UN Treaties (Russian, English, Arabic, . . . )
Manuals: PHP, KDE, OpenOffice (all from OPUS, many languages) Web pages: STRAND project (Philip Resnik)
Sentence Alignment
If document De is translation of document Df how do we find the translation for each sentence?
The n-th sentence in De is not necessarily the translation of the n-th sentence in document Df
In addition to 1:1 alignments, there are also 1:0, 0:1, 1:n, and n:1 alignments
Approximately 90% of the sentence alignments are 1:1
Sentence Alignment (c’ntd)
There are several sentence alignment algorithms: Align (Gale & Church): Aligns sentences based on their
character length (shorter sentences tend to have shorter translations then longer sentences). Works astonishingly well
Char-align: (Church): Aligns based on shared character sequences. Works fine for similar languages or technical domains
K-Vec (Fung & Church): Induces a translation lexicon from the parallel texts based on the distribution of foreign-English word pairs.
Computing Translation Probabilities
Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f)
Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!
P(e | f ) could be re-defined as:
Problem: The English words maximizing
P(e | f ) might not result in a readable sentence
P(e | f ) maxeif j
P(ei | f j )
Computing Translation Probabilities (c’tnd)
We can account for adequacy: each foreign word translates into its most likely English word
We cannot guarantee that this will result in a fluent English sentence
Solution: transform P(e | f) with Bayes’ rule: P(e | f) = P(e) P(f | e) / P(f)
P(f | e) accounts for adequacy P(e) accounts for fluency
Decoding
The decoder combines the evidence from P(e) and P(f | e) to find the sequence e that is the best translation:
The choice of word e’ as translation of f’ depends on the translation probability P(f’ | e’) and on the context, i.e. other English words preceding e’
argmaxe
P(e | f ) argmaxe
P( f | e)P(e)
Noisy Channel Model for Translation
Noisy Channel Model
Generative story: Generate e with probability p(e) Pass e through noisy channel Out comes f with probability p(f|e)
Translation task: Given f, deduce most likely e that produced f, or:
)|(maxarg fepe
Translation Model
How to model P(f|e)? Learn parameters of P(f|e) from a bilingual corpus S of
sentence pairs <ei,fi> :
< e1,f1 > = <the blue witch, la bruja azul>
< e2,f2 > = <green, verde>
…
< eS,fS > = <the witch, la bruja>
Translation Model
Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?)
Decompose process of translating e -> f into small steps whose probabilities can be estimated
Translation Model
English sentence e = e1…el
Foreign sentence f = f1…fm
Alignment A = {a1…am}, where aj ε {0…l} A indicates which English word generates each
foreign word
Alignments
e: “the blue witch”
f: “la bruja azul”
A = {1,3,2} (intuitively “good” alignment)
Alignments
e: “the blue witch”
f: “la bruja azul”
A = {1,1,1} (intuitively “bad” alignment)
Alignments
e: “the blue witch”
f: “la bruja azul”
(illegal alignment!)
Alignments
Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?
Alignments
Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?
Answer: Each foreign word can align with any one of |e| = l
words, or it can remain unaligned Each foreign word has (l + 1) choices for an
alignment, and there are |f| = m foreign words So, there are (l+1)^m alignments for a given e and f
Alignments
Question: If all alignments are equally likely, what is the probability of any one alignment, given e?
Alignments
Question: If all alignments are equally likely, what is the probability of any one alignment, given e?
Answer: P(A|e) = p(|f| = m) * 1/(l+1)^m If we assume that p(|f| = m) is uniform over all
possible values of |f|, then we can let p(|f| = m) = C P(A|e) = C /(l+1)^m
Generative Story
e: “blue witch”
f: “bruja azul”
? How do we get from e to f?
Language Modeling
Determines the probability of some English sequence of length l
P(e) is hard to estimate directly, unless l is very small
P(e) is normally approximated as:
where m is size of the context, i.e. number of previous words that are considered, normally m=2 (tri-gram language model
e1l
P(e1l ) P(e1 ) P(eii2
l | e1i 1)
P(e1l ) P(e1 )P(e2 | e1) P(eii3
l | ei mi 1 )
Translation Modeling
Determines the probability that the foreign word f is a translation of the English word e
How to compute P(f | e) from a parallel corpus?
Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another
Finding Translations in a Parallel Corpus
Into which foreign words f, . . . , f’ does e translate? Commonly, four factors are used:
How often do e and f co-occur? (translation) How likely is a word occurring at position i to translate into a
word occurring at position j? (distortion) For example: English is a verb-second language, whereas German is a verb-final language
How likely is e to translate into more than one word? (fertility) For example: defeated can translate into eine Niederlage erleiden
How likely is a foreign word to be spuriously generated? (null translation)
Translation Steps
IBM Models 1–5
Model 1: Bag of words Unique local maxima Efficient EM algorithm (Model 1–2)
Model 2: General alignment: Model 3: fertility: n(k | e)
No full EM, count only neighbors (Model 3–5) Deficient (Model 3–4)
Model 4: Relative distortion, word classes Model 5: Extra variables to avoid deficiency
a(epos | f pos,elength, f length )
IBM Model 1
Model parameters: T(fj | eaj ) = translation probability of foreign word given
English word that generated it
IBM Model 1
Generative story:
Given e:
Pick m = |f|, where all lengths m are equally probable
Pick A with probability P(A|e) = 1/(l+1)^m, since all
alignments are equally likely given l and m
Pick f1…fm with probability
where T(fj | eaj ) is the translation probability of fj given the
English word it is aligned to
m
jaj j
efTeAfP1
)|(),|(
IBM Model 1 Example
e: “blue witch”
IBM Model 1 Example
e: “blue witch”
f: “f1 f2”
Pick m = |f| = 2
IBM Model 1 Example
e: blue witch”
f: “f1 f2”
Pick A = {2,1} with probability 1/(l+1)^m
IBM Model 1 Example
e: blue witch”
f: “bruja f2”
Pick f1 = “bruja” with probability t(bruja|witch)
IBM Model 1 Example
e: blue witch”
f: “bruja azul”
Pick f2 = “azul” with probability t(azul|blue)
IBM Model 1: Parameter Estimation
How does this generative story help us to estimate P(f|e) from the data?
Since the model for P(f|e) contains the parameter T(fj | eaj ), we first need to estimate T(fj | eaj )
lBM Model 1: Parameter Estimation
How to estimate T(fj | eaj ) from the data? If we had the data and the alignments A, along with P(A|
f,e), then we could estimate T(fj | eaj ) using expected counts as follows:
'' ),(
),()|(
jj
j
j
faj
aj
aj efCount
efCountefT
lBM Model 1: Parameter Estimation
How to estimate P(A|f,e)? P(A|f,e) = P(A,f|e) / P(f|e) But So we need to compute P(A,f|e)… This is given by the Model 1 generative story:
A
efAPefP )|,()|(
m
jajm j
efTl
CefAP
1)|(*
)1()|,(
IBM Model 1 Example
e: “the blue witch”
f: “la bruja azul”
P(A|f,e) = P(f,A|e)/ P(f|e) =
j
ajAA j
efTC
blueazultwitchbrujatthelatC
)|(*4
)|(*)|(*)|(*4
3
3
IBM Model 1: Parameter Estimation
So, in order to estimate P(f|e), we first need to estimate the model parameter
T(fj | eaj )
In order to compute T(fj | eaj ) , we need to estimate P(A|f,e)
And in order to compute P(A|f,e), we need to estimate T(fj | eaj )…
IBM Model 1: Parameter Estimation
Training data is a set of pairs < ei, fi>
Log likelihood of training data given model parameters is:
To maximize log likelihood of training data given model
parameters, use EM:
hidden variable = alignments A
model parameters = translation probabilities T
),|(*)|(log)|(log iii i A
i eAfPeAPefP
EM
Initialize model parameters T(f|e) Calculate alignment probabilities P(A|f,e) under current
values of T(f|e) Calculate expected counts from alignment probabilities Re-estimate T(f|e) from these expected counts Repeat until log likelihood of training data converges to a
maximum
IBM Model 1 Example
Parallel ‘corpus’:the dog :: le chienthe cat :: le chat
Step 1+2 (collect candidates and initialize uniformly):P(le | the) = P(chien | the) = P(chat | the) = 1/3P(le | dog) = P(chien | dog) = P(chat | dog) = 1/3P(le | cat) = P(chien | cat) = P(chat | cat) = 1/3P(le | NULL) = P(chien | NULL) = P(chat | NULL) = 1/3
IBM Model 1 Example
Step 3: Iterate NULL the dog :: le chien
j=1
total = P(le | NULL)+P(le | the)+P(le | dog)= 1
tc(le | NULL) += P(le | NULL)/1 = 0 += .333/1 = 0.333
tc(le | the) += P(le | the)/1 = 0 += .333/1 = 0.333
tc(le | dog) += P(le | dog)/1 = 0 += .333/1 = 0.333 j=2
total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1
tc(chien | NULL) += P(chien | NULL)/1 = 0 += .333/1 = 0.333
tc(chien | the) += P(chien | the)/1 = 0 += .333/1 = 0.333
tc(chien | dog) += P(chien | dog)/1 = 0 += .333/1 = 0.333
IBM Model 1 Example
NULL the cat :: le chat j=1
total = P(le | NULL)+P(le | the)+P(le | cat)=1tc(le | NULL) += P(le | NULL)/1 = 0.333 += .333/1 = 0.666tc(le | the) += P(le | the)/1 = 0.333 += .333/1 = 0.666tc(le | cat) += P(le | cat)/1 = 0 +=.333/1 = 0.333
j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1tc(chat | NULL) += P(chat | NULL)/1 = 0 += .333/1 = 0.333tc(chat | the) += P(chat | the)/1 = 0 += .333/1 = 0.333tc(chat | cat) += P(chat | dog)/1 = 0 += .333/1 = 0.333
IBM Model 1 Example
Re-compute translation probabilities total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)
= 0.666 + 0.333 + 0.333 = 1.333 P(le | the) = tc(le | the)/total(the)
= 0.666 / 1.333 = 0.5 P(chien | the) = tc(chien | the)/total(the)
= 0.333/1.333 0.25 P(chat | the) = tc(chat | the)/total(the)
= 0.333/1.333 0.25 total(dog) = tc(le | dog) + tc(chien | dog) = 0.666 P(le | dog) = tc(le | dog)/total(dog)
= 0.333 / 0.666 = 0.5 P(chien | dog) = tc(chien | dog)/total(dog)
= 0.333 / 0.666 = 0.5
IBM Model 1 Example
Iteration 2: NULL the dog :: le chien
j=1 total = P(le | NULL)+P(le | the)+P(le | dog)= 1.5
= 0.5 + 0.5 + 0.5 = 1.5tc(le | NULL) += P(le | NULL)/1 = 0 += .5/1.5 = 0.333tc(le | the) += P(le | the)/1 = 0 += .5/1.5 = 0.333tc(le | dog) += P(le | dog)/1 = 0 += .5/1.5 = 0.333
j=2total = P(chien | NULL)+P(chien | the)+P(chien | dog)=1
= 0.25 + 0.25 + 0.5 = 1tc(chien | NULL) += P(chien | NULL)/1 = 0 += .25/1 = 0.25tc(chien | the) += P(chien | the)/1 = 0 += .25/1 = 0.25tc(chien | dog) += P(chien | dog)/1 = 0 += .5/1 = 0.5
IBM Model 1 Example NULL the cat :: le chat
j=1
total = P(le | NULL)+P(le | the)+P(le | cat)= 1.5
= 0.5 + 0.5 + 0.5 = 1.5
tc(le | NULL) += P(le | NULL)/1 = 0.333 += .5/1 = 0.833
tc(le | the) += P(le | the)/1 = 0.333 += .5/1 = 0.833
tc(le | cat) += P(le | cat)/1 = 0 += .5/1 = 0.5 j=2
total = P(chat | NULL)+P(chat | the)+P(chat | cat)=1
= 0.25 + 0.25 + 0.5 = 1
tc(chat | NULL) += P(chat | NULL)/1 = 0 += .25/1 = 0.25
tc(chat | the) += P(chat | the)/1 = 0 += .25/1 = 0.25
tc(chat | cat) += P(chat | cat)/1 = 0 += .5/1 = 0.5
IBM Model 1 Example
Re-compute translations (iteration 2): total(the) = tc(le | the) + tc(chien | the) + tc(chat | the)
= .833 + 0.25 + 0.25 = 1.333
P(le | the) = tc(le | the)/total(the)
= .833 / 1.333 = 0.625
P(chien | the) = tc(chien | the)/total(the)
= 0.25/1.333 = 0.188
P(chat | the) = tc(chat | the)/total(the)
= 0.25/1.333 = 0.188 total(dog) = tc(le | dog) + tc(chien | dog)
= 0.333 + 0.5 = 0.833
P(le | dog) = tc(le | dog)/total(dog)
= 0.333 / 0.833 = 0.4
P(chien | dog) = tc(chien | dog)/total(dog)
= 0.5 / 0.833 = 0.6
IBM Model 1Example
After 5 iterations:
P(le | NULL) = 0.755608028335301
P(chien | NULL) = 0.122195985832349
P(chat | NULL) = 0.122195985832349
P(le | the) = 0.755608028335301
P(chien | the) = 0.122195985832349
P(chat | the) = 0.122195985832349
P(le | dog) = 0.161943319838057
P(chien | dog) = 0.838056680161943
P(le | cat) = 0.161943319838057
P(chat | cat) = 0.838056680161943
IBM Model 1 Recap
IBM Model 1 allows for an efficient computation of
translation probabilities
No notion of fertility, i.e., it’s possible that the same
English word is the best translation for all foreign words
No positional information, i.e., depending on the
language pair, there might be a tendency that words
occurring at the beginning of the English sentence are
more likely to align to words at the beginning of the
foreign sentence
IBM Model 2
Model parameters: T(fj | eaj ) = translation probability of foreign word fj
given English word eaj that generated it
d(i|j,l,m) = distortion probability, or probability that fj is aligned to ei , given l and m
IBM Model 3
Model parameters: T(fj | eaj ) = translation probability of foreign word fj
given English word eaj that generated it
r(j|i,l,m) = reverse distortion probability, or probability of position fj, given its alignment to ei, l, and m
n(ei) = fertility of word ei , or number of foreign words
aligned to ei
p1 = probability of generating a foreign word by
alignment with the NULL English word
IBM Model 3
IBM Model 3 offers two additional features compared to IBM Model 1: How likely is an English word e to align to k foreign
words (fertility)? Positional information (distortion), how likely is a word
in position i to align to a word in position j?
IBM Model 3: Fertility
The best Model 1 alignment could be that a single English word
aligns to all foreign words
This is clearly not desirable and we want to constrain the number of
words an English word can align to
Fertility models a probability distribution that word e aligns to k
words: n(k,e)
Consequence: translation probabilities cannot be computed
independently of each other anymore
IBM Model 3 has to work with full alignments, note there are up to
(l+1)m different alignments
IBM Model 3
Generative Story: Choose fertilities for each English word Insert spurious words according to probability of being
aligned to the NULL English word Translate English words -> foreign words Reorder words according to reverse distortion
probabilities
IBM Model 3 Example
Consider the following example from [Knight 1999]: Maria did not slap the green witch
IBM Model 3 Example
Maria did not slap the green witch
Maria not slap slap slap the green witch
Choose fertilities: phi(Maria) = 1
IBM Model 3 Example
Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Insert spurious words: p(NULL)
IBM Model 3 Example
Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Maria no dio una bofetada a la verde bruja
Translate words: t(verde|green)
IBM Model 3 Example
Maria no dio una bofetada a la verde bruja
Maria no dio una bofetada a la bruja verde
Reorder words
IBM Model 3
For models 1 and 2: We can compute exact EM updates
For models 3 and 4: Exact EM updates cannot be efficiently computed Use best alignments from previous iterations to
initialize each successive model Explore only the subspace of potential alignments that
lies within same neighborhood as the initial alignments
IBM Model 4
Model parameters: Same as model 3, except uses more complicated
model of reordering (for details, see Brown et al. 1993)
IBM Model 1 + Model 3
Iterating over all possible alignments is computationally infeasible
Solution: Compute the best alignment with Model 1 and change some of the alignments to generate a set of likely alignments (pegging)
Model 3 takes this restricted set of alignments as input
Pegging
Given an alignment a we can derive additional alignments from it by making small changes: Changing a link (j,i) to (j,i’) Swapping a pair of links (j,i) and (j’,i’) to (j,i’) and (j’,i)
The resulting set of alignments is called the neighborhood of a
IBM Model 3: Distortion
The distortion factor determines how likely it is that an English word
in position i aligns to a foreign word in position j, given the lengths of
both sentences:
d(j | i, l, m)
Note, positions are absolute positions
Deficiency Problem with IBM Model 3: It assigns probability mass to
impossible strings Well formed string: “This is possible” Ill-formed but possible string: “This possible is” Impossible string:
Impossible strings are due to distortion values that generate different words at the same position
Impossible strings can still be filtered out in later stages of the translation process
Limitations of IBM Models
Only 1-to-N word mapping Handling fertility-zero words (difficult for decoding) Almost no syntactic information
Word classes Relative distortion
Long-distance word movement Fluency of the output depends entirely on the English
language model
Decoding
How to translate new sentences? A decoder uses the parameters learned on a parallel
corpus Translation probabilities Fertilities Distortions
In combination with a language model the decoder generates the most likely translation
Standard algorithms can be used to explore the search space (A*, greedy searching, …)
Similar to the traveling salesman problem
Three Problems for Statistical MT
Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e)
Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e)
Decoding algorithm Given a language model, a translation model, and a new
sentence f … find translation e maximizing P(e) * P(f | e)
Slide from Kevin Knight
The Classic Language ModelWord N-Grams
Goal of the language model -- choose among:
He is on the soccer fieldHe is in the soccer field
Is table the on cup theThe cup is on the table
Rice shrineAmerican shrineRice companyAmerican company
Slide from Kevin Knight
Intuition of phrase-based translation (Koehn et al. 2003)
Generative story has three steps
1) Group words into phrases
2) Translate each phrase
3) Move the phrases around
Generative story again
1) Group English source words into phrases e1, e2, …, en
2) Translate each English phrase ei into a Spanish phrase fj.
1) The probability of doing this is (fj|ei)
3) Then (optionally) reorder each Spanish phrase
1) We do this with a distortion probability
2) A measure of distance between positions of a corresponding phrase in the 2 lgs.
3) “What is the probability that a phrase in position X in the English sentences moves to position Y in the Spanish sentence?”
Distortion probability
The distortion probability is parameterized by ai-bi-1
Where ai is the start position of the foreign (Spanish) phrase generated by the ith English phrase ei.
And bi-1 is the end position of the foreign (Spanish) phrase generated by the I-1th English phrase ei-1.
We’ll call the distortion probability d(ai-bi-1). And we’ll have a really stupid model:
d(ai-bi-1) = |ai-bi-1|
Where is some small constant.
Final translation model for phrase-based MT
Let’s look at a simple example with no distortion
P(F | E) ( f i,e ii1
l
)d(ai bi 1)
Phrase-based MT
Language model P(E) Translation model P(F|E)
Model How to train the model
Decoder: finding the sentence E that is most probable
Training P(F|E)
What we mainly need to train is (fj|ei) Suppose we had a large bilingual training corpus
A bitext In which each English sentence is paired with a
Spanish sentence And suppose we knew exactly which phrase in Spanish
was the translation of which phrase in the English We call this a phrase alignment If we had this, we could just count-and-divide:
But we don’t have phrase alignments
What we have instead are word alignments:
Getting phrase alignments
To get phrase alignments:
1) We first get word alignments
2) Then we “symmetrize” the word alignments into phrase alignments
How to get Word Alignments
Word alignment: a mapping between the source words and the target words in a set of parallel sentences.
Restriction: each foreign word comes from exactly 1 English word
Advantage: represent an alignment by the index of the English word that the French word comes from
Alignment above is thus 2,3,4,5,6,6,6
One addition: spurious words
A word in the foreign sentence That doesn’t align with any word in the English sentence Is called a spurious word. We model these by pretending they are generated by an
English word e0:
More sophisticated models of alignment
Computing word alignments: IBM Model 1
For phrase-based machine translation We want a word-alignment To extract a set of phrases A word alignment algorithm gives us P(F,E) We want this to train our phrase probabilities (fj|ei) as
part of P(F|E) But a word-alignment algorithm can also be part of a
mini-translation model itself.
IBM Model 1
IBM Model 1
How does the generative story assign P(F|E) for a Spanish sentence F?
Terminology:
Suppose we had done steps 1 and 2, I.e. we already knew the Spanish length J and the alignment A (and English source E):
Let’s formalize steps 1 and 2
We want P(A|E) of an alignment A (of length J) given an English sentence E
IBM Model 1 makes the (very) simplifying assumption that each alignment is equally likely.
How many possible alignments are there between English sentence of length I and Spanish sentence of length J?
Hint: Each Spanish word must come from one of the English source words (or the NULL word)
(I+1)J
Let’s assume probability of choosing length J is small constant epsilon
Model 1 continued
Prob of choosing a length and then one of the possible alignments:
Combining with step 3:
The total probability of a given foreign sentence F:
Decoding
How do we find the best A?
Training alignment probabilities
Step 1: get a parallel corpus Hansards
Canadian parliamentary proceedings, in French and English Hong Kong Hansards: English and Chinese
Step 2: sentence alignment Step 3: use EM (Expectation Maximization) to train word
alignments
Step 1: Parallel corpora
English German
Diverging opinions about planned tax reform
Unterschiedliche Meinungen zur geplanten Steuerreform
The discussion around the envisaged major tax reform continues .
Die Diskussion um die vorgesehene grosse Steuerreform dauert an .
The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .
Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .
Example from DE-News (8/1/1996)
Slide from Christof Monz
Step 2: Sentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.
Intuition:
- use length in words or chars
- together with dynamic programming
- or use a simpler MT model
El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.
Slide from Kevin Knight
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él. Los tiburones esperan.
Slide from Kevin Knight
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él.
Los tiburones esperan.
Slide from Kevin Knight
Sentence Alignment
1. The old man is happy. He has fished many times.
2. His wife talks to him.
3. The sharks await.
El viejo está feliz porque ha pescado muchos veces.
Su mujer habla con él.
Los tiburones esperan.
Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).
Slide from Kevin Knight
Step 3: word alignments
It turns out we can bootstrap alignments From a sentence-aligned bilingual corpus We use is the Expectation-Maximization or EM
algorithm
EM for training alignment probs
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All word alignments equally likely
All P(french-word | english-word) equally likely
Slide from Kevin Knight
EM for training alignment probs
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,so P(la | the) is increased.
Slide from Kevin Knight
EM for training alignment probs
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“house” co-occurs with both “la” and “maison”, butP(maison | house) can be raised without limit, to 1.0,
while P(la | house) is limited because of “the”
(pigeonhole principle)
Slide from Kevin Knight
EM for training alignment probs
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
Slide from Kevin Knight
EM for training alignment probs
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!For details, see:
•Section 24.6.1 in the chapter• “A Statistical MT Tutorial Workbook” (Knight, 1999).• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)• Software: GIZA++
Slide from Kevin Knight
Statistical Machine Translation
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
P(juste | fair) = 0.411P(juste | correct) = 0.027P(juste | right) = 0.020 …
new Frenchsentence
Possible English translations,to be rescored by language model
Slide from Kevin Knight
A more complex model: IBM Model 3Brown et al., 1993
Mary did not slap the green witch
Mary not slap slap slap the green witch n(3|slap)
Maria no dió una bofetada a la bruja verde
d(j|i)
Mary not slap slap slap NULL the green witchP-Null
Maria no dió una bofetada a la verde brujat(la|the)
Generative approach:
Probabilities can be learned from raw bilingual text.
How do we evaluate MT? Human tests for fluency
Rating tests: Give the raters a scale (1 to 5) and ask them to rate Or distinct scales for
Clarity, Naturalness, Style Or check for specific problems
Cohesion (Lexical chains, anaphora, ellipsis) Hand-checking for cohesion.
Well-formedness 5-point scale of syntactic correctness
Comprehensibility tests Noise test Multiple choice questionnaire
Readability tests cloze
How do we evaluate MT? Human tests for fidelity
Adequacy Does it convey the information in the original? Ask raters to rate on a scale
Bilingual raters: give them source and target sentence, ask how much information is preserved
Monolingual raters: give them target + a good human translation
Informativeness Task based: is there enough info to do some task? Give raters multiple-choice questions about
content
Evaluating MT: Problems
Asking humans to judge sentences on a 5-point scale for 10 factors takes time and $$$ (weeks or months!)
We can’t build language engineering systems if we can only evaluate them once every quarter!!!!
We need a metric that we can run every time we change our algorithm.
It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly.
Bonnie Dorr
Automatic evaluation
Miller and Beebe-Center (1958) Assume we have one or more human translations of the
source passage Compare the automatic translation to these human
translations Bleu NIST Meteor Precision/Recall
BiLingual Evaluation Understudy (BLEU —Papineni, 2001)
Automatic Technique, but …. Requires the pre-existence of Human (Reference) Translations Approach:
Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and
1 or more reference translations
http://www.research.ibm.com/people/k/kishore/RC22176.pdf
Slide from Bonnie Dorr
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU Evaluation Metric(Papineni et al, ACL-2002)
• N-gram precision (score is between 0 & 1)– What percentage of machine n-grams can
be found in the reference translation? – An n-gram is an sequence of n words
– Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”)
• Brevity penalty– Can’t just type out single word “the”
(precision 1.0!)
*** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t)
Slide from Bonnie Dorr
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU Evaluation Metric(Papineni et al, ACL-2002)
• BLEU4 formula (counts n-grams up to length 4)
exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0)
p1 = 1-gram precisionP2 = 2-gram precisionP3 = 3-gram precisionP4 = 4-gram precision
Slide from Bonnie Dorr
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Multiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Slide from Bonnie Dorr
Bleu Comparison
Chinese-English Translation Example:
Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Slide from Bonnie Dorr
How Do We Compute Bleu Scores? Intuition: “What percentage of words in candidate occurred in some
human translation?” Proposal: count up # of candidate translation words (unigrams) # in
any reference translation, divide by the total # of words in # candidate translation
But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat
Solution: A reference word should be considered exhausted after a matching candidate word is identified.
Slide from Bonnie Dorr
“Modified n-gram precision”
For each word compute: (1) total number of times it occurs in any single reference translation(2) number of times it occurs in the candidate translation
Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription
Now use that modified count. And divide by number of candidate words.
Slide from Bonnie Dorr
Modified Unigram Precision: Candidate #1
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1)
What’s the answer???
17/18
Slide from Bonnie Dorr
Modified Unigram Precision: Candidate #2
It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0)
What’s the answer????
8/14
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Slide from Bonnie Dorr
Modified Bigram Precision: Candidate #1
It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1)
What’s the answer????
10/17
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Slide from Bonnie Dorr
Modified Bigram Precision: Candidate #2
Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0)
What’s the answer????
1/13
Slide from Bonnie Dorr
Catching Cheaters
Reference 1: The cat is on the mat
Reference 2: There is a cat on the mat
the(2) the the the(0) the(0) the(0) the(0)
What’s the unigram answer?
2/7
What’s the bigram answer?
0/7
Slide from Bonnie Dorr
Bleu distinguishes human from machine translations
Slide from Bonnie Dorr
Bleu problems with sentence length
Candidate: of the
Solution: brevity penalty; prefers candidates translations which are same length as one of the references
Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.
Problem: modified unigram precision is 2/2, bigram 1/1!
Slide from Bonnie Dorr
BLEU Tends to Predict Human Judgments
R2 = 88.0%
R2 = 90.2%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Human Judgments
NIS
T S
co
re
Adequacy
Fluency
Linear(Adequacy)Linear(Fluency)
slide from G. Doddington (NIST)
(va
ria
nt
of
BL
EU
)
Summary
Intro and a little history Language Similarities and Divergences Four main MT Approaches
Transfer Interlingua Direct Statistical
Evaluation
Classes
LINGUIST 139M/239M. Human and Machine Translation. (Martin Kay)
CS 224N. Natural Language Processing (Chris Manning)