Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s...
Transcript of Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s...
![Page 1: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/1.jpg)
Lexical alignment: IBM models 1 and 2MLE via EM for categorical distributions
Wilker Aziz
April 11, 2017
![Page 2: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/2.jpg)
Translation data
Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
Is there anything we could say about this language?
1 / 29
![Page 3: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/3.jpg)
Translation data
Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
Is there anything we could say about this language?
1 / 29
![Page 4: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/4.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 5: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/5.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 6: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/6.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 7: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/7.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 8: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/8.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 9: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/9.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 10: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/10.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 11: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/11.jpg)
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
![Page 12: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/12.jpg)
Probabilistic lexical alignment models
This lecture is about operationalising this intuition
I through a probabilistic learning algorithm
I for a non-probabilistic approach see for example[Lardilleux and Lepage, 2009]
3 / 29
![Page 13: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/13.jpg)
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
4 / 29
![Page 14: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/14.jpg)
Word-to-word alignments
Imagine you are given a text
the black dog o cao pretothe nice dog o cao amigothe black cat o gato preto
a dog chasing a cat um cao perseguindo um gato
5 / 29
![Page 15: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/15.jpg)
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
5 / 29
![Page 16: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/16.jpg)
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
and suppose our task is to have a model explain the original data
5 / 29
![Page 17: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/17.jpg)
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
and suppose our task is to have a model explain the original databy generating each French word from exactly one English word
5 / 29
![Page 18: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/18.jpg)
Generative story
For each sentence pair independently,
1. observe an English sentence e1, · · · , emand a French sentence length n
2. for each French word position j from 1 to n
2.1 select an English position aj2.2 conditioned on the English word eaj
, generate fj
We have introduced an alignmentwhich is not directly visible in the data
6 / 29
![Page 19: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/19.jpg)
Generative story
For each sentence pair independently,
1. observe an English sentence e1, · · · , emand a French sentence length n
2. for each French word position j from 1 to n
2.1 select an English position aj2.2 conditioned on the English word eaj
, generate fj
We have introduced an alignmentwhich is not directly visible in the data
6 / 29
![Page 20: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/20.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
7 / 29
![Page 21: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/21.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (A1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
![Page 22: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/22.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
![Page 23: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/23.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
![Page 24: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/24.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, EA2 → F2) (A3, EA3 → F3)
7 / 29
![Page 25: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/25.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (A3, EA3 → F3)
7 / 29
![Page 26: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/26.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, EA3 → F3)
7 / 29
![Page 27: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/27.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
7 / 29
![Page 28: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/28.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)
7 / 29
![Page 29: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/29.jpg)
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)
the black dog (a1, ea1 → f1) (a2, ea2 → f2) (a3, ea3 → f3)
7 / 29
![Page 30: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/30.jpg)
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
8 / 29
![Page 31: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/31.jpg)
Mixture models: generative story
xz
m
I c mixture components
I each defines a distribution over the same data space XI plus a distribution over components themselves
Generative story
1. select a mixture component z ∼ P (Z)
2. generate an observation from it x ∼ P (X|Z = z)
9 / 29
![Page 32: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/32.jpg)
Mixture models: generative story
xz
m
I c mixture components
I each defines a distribution over the same data space XI plus a distribution over components themselves
Generative story
1. select a mixture component z ∼ P (Z)
2. generate an observation from it x ∼ P (X|Z = z)
9 / 29
![Page 33: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/33.jpg)
Mixture models: likelihood
xz
m
Incomplete-data likelihood
P (xm1 ) =
m∏i=1
P (xi) (1)
=m∏i=1
c∑z=1
P (X = xi, Z = z)︸ ︷︷ ︸complete-data likelihood
(2)
=
m∏i=1
c∑z=1
P (Z = z)P (X = xi|Z = z) (3)
10 / 29
![Page 34: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/34.jpg)
Interpretation
Missing data
I Let Z take one of c mixture components
I Assume data consists of pairs (x, z)
I x is always observed
I y is always missing
Inference: posterior distribution over possible Z for each x
P (Z = z|X = x) =P (Z = z,X = x)∑c
z′=1 P (Z = z′, X = x)(4)
=P (Z = z)P (X = x|Z = z)∑c
z′=1 P (Z = z′)P (X = x|Z = z′)(5)
11 / 29
![Page 35: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/35.jpg)
Interpretation
Missing data
I Let Z take one of c mixture components
I Assume data consists of pairs (x, z)
I x is always observed
I y is always missing
Inference: posterior distribution over possible Z for each x
P (Z = z|X = x) =P (Z = z,X = x)∑c
z′=1 P (Z = z′, X = x)(4)
=P (Z = z)P (X = x|Z = z)∑c
z′=1 P (Z = z′)P (X = x|Z = z′)(5)
11 / 29
![Page 36: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/36.jpg)
Non-identifiability
Different parameter settings, same distribution
Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5
Z X = a X = b
1 0.2 0.8
2 0.7 0.3
P (X) 0.45 0.55
Z X = a X = b
1 0.7 0.3
2 0.2 0.8
P (X) 0.45 0.55
Problem for parameter estimation by hillclimbing
12 / 29
![Page 37: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/37.jpg)
Non-identifiability
Different parameter settings, same distribution
Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5
Z X = a X = b
1 0.2 0.8
2 0.7 0.3
P (X) 0.45 0.55
Z X = a X = b
1 0.7 0.3
2 0.2 0.8
P (X) 0.45 0.55
Problem for parameter estimation by hillclimbing
12 / 29
![Page 38: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/38.jpg)
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}
Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
![Page 39: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/39.jpg)
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θ
Likelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
![Page 40: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/40.jpg)
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
![Page 41: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/41.jpg)
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =
m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
![Page 42: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/42.jpg)
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =
m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
![Page 43: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/43.jpg)
MLE for categorical: estimation from fully observed data
Suppose we have complete data
I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}
Then, for a categorical distribution
P (X = x|Z = z) = θz,x
and n(z, x|Dcomplete) = count of (z, x) in Dcomplete
MLE solution:
θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)
14 / 29
![Page 44: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/44.jpg)
MLE for categorical: estimation from fully observed data
Suppose we have complete data
I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}
Then, for a categorical distribution
P (X = x|Z = z) = θz,x
and n(z, x|Dcomplete) = count of (z, x) in Dcomplete
MLE solution:
θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)
14 / 29
![Page 45: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/45.jpg)
MLE for categorical: estimation from incomplete data
Expectation-Maximisation algorithm [Dempster et al., 1977]
E-step:
I for every observation x, imagine that every possible latentassignment z happened with probability Pθ(Z = z|X = x)
Dcompleted = {(x, Z = 1), . . . , (x, Z = c) : x ∈ D}
15 / 29
![Page 46: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/46.jpg)
MLE for categorical: estimation from incomplete dataExpectation-Maximisation algorithm [Dempster et al., 1977]
M-step:
I reestimate θ as to climb the likelihood surface
I for categorical distributions P (X = x|Z = z) = θz,xz and x are categorical0 ≤ θz,x ≤ 1 and
∑x∈X θz,x = 1
θz,x =E[n(z → x|Dcompleted)]∑x′ E[n(z → x′|Dcompleted)]
(6)
=
∑mi=1
∑z′ P (z′|x(i))1z(z′)1x(x(i))∑m
i=1
∑x′∑
z′ P (z′|x(i))1z(z′)1x′(x(i))(7)
=
∑mi=1 P (z|x(i))1x(x(i))∑m
i=1
∑x′ P (z|x(i))1x′(x(i))
(8)
16 / 29
![Page 47: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/47.jpg)
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
17 / 29
![Page 48: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/48.jpg)
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
![Page 49: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/49.jpg)
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
![Page 50: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/50.jpg)
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
![Page 51: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/51.jpg)
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
![Page 52: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/52.jpg)
Parameterisation
Alignment distribution: uniform
P (A|M = m,N = n) =1
m+ 1(9)
Lexical distribution: categorical
P (F |E = e) = Cat(F |θe) (10)
I where θe ∈ RvFI 0 ≤ θe,f ≤ 1
I∑
f θe,f = 1
19 / 29
![Page 53: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/53.jpg)
IBM1: incomplete-data likelihood
f
a
n
em0
m
n
S
Incomplete-data likelihood
P (fn1 |em0 ) =
m∑a1=0
· · ·m∑
an=0
P (fn1 , an1 |eaj ) (11)
=
m∑a1=0
· · ·m∑
an=0
n∏j=1
P (aj |m,n)P (fj |eaj ) (12)
=n∏j=1
m∑aj=0
P (aj |m,n)P (fj |eaj ) (13)
20 / 29
![Page 54: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/54.jpg)
IBM1: posterior
Posterior
P (an1 |fn1 , em0 ) =P (fn1 , a
n1 |em0 )
P (fn1 |em0 )(14)
Factorised
P (aj |fn1 , em0 ) =P (aj |m,n)P (fj |eaj )∑mi=0 P (i|m,n)P (fj |ei)
(15)
21 / 29
![Page 55: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/55.jpg)
MLE via EM
E-step:
E[n(e→ f|An1 )] =
m∑a1=0
· · ·m∑
an=0
P (an1 |fn1 , em0 )n(e→ f|An1 ) (16)
=
m∑a1=0
· · ·m∑
An=0
n∏j=1
P (aj |fn1 , em0 )1e(eaj)1f(fj) (17)
=
n∏j=1
m∑i=0
P (Aj = i|fn1 , em0 )1e(ei)1f(fj) (18)
M-step:
θe,f =E[n(e→ f |An1 )]∑f ′ E[n(e→ f ′|An1 )]
(19)
22 / 29
![Page 56: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/56.jpg)
EM algorithm
Repeat until convergence to a local optimum
1. For each sentence pair
1.1 compute posterior per alignment link1.2 accumulate fractional counts
2. Normalise counts for each English word
23 / 29
![Page 57: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/57.jpg)
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
24 / 29
![Page 58: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/58.jpg)
Alignment distribution
Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)
I one distribution for each tuple (j,m, n)
I support must include length of longest English sentence
I extremely over-parameterised!
Jump distribution [Vogel et al., 1996]
I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋
I P (Aj |m,n) = Cat(∆|λ)
I ∆ takes values from −longest to +longest
25 / 29
![Page 59: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/59.jpg)
Alignment distribution
Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)
I one distribution for each tuple (j,m, n)
I support must include length of longest English sentence
I extremely over-parameterised!
Jump distribution [Vogel et al., 1996]
I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋
I P (Aj |m,n) = Cat(∆|λ)
I ∆ takes values from −longest to +longest
25 / 29
![Page 60: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/60.jpg)
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
26 / 29
![Page 61: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/61.jpg)
Note on terminology: source/target vs French/English
From an alignment model perspective all that matters is
I we condition on one language and generate the other
I in IBM models terminology, we condition on English andgenerate French
From a noisy channel perspective, where we want to translate asource sentence fn1 into some target sentence em1I Bayes rule decomposes p(em1 |fn1 ) ∝ p(fn1 |em1 )p(em1 )
I train p(em1 ) and p(fn1 |em1 ) independently
I language model: p(em1 )
I alignment model: p(fn1 |em1 )
I note that the alignment model conditions on the targetsentence (English) and generates the source sentence (French)
27 / 29
![Page 62: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/62.jpg)
Limitations of IBM1-2
I too strong independence assumptions
I categorical parameterisation suffers from data sparsity
I EM suffers from local optima
28 / 29
![Page 63: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/63.jpg)
Extensions
Fertility, distortion, and concepts [Brown et al., 1993]
Dirichlet priors and posterior inference [Mermer and Saraclar, 2011]
I + no Null words [Schulz et al., 2016]
I + HMM and efficient sampler [Schulz and Aziz, 2016]
Log-linear distortion parameters and variational Bayes[Dyer et al., 2013]
First-order dependency (HMM) [Vogel et al., 1996]
I E-step requires dynamic programming[Baum and Petrie, 1966]
29 / 29
![Page 64: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/64.jpg)
References I
L. E. Baum and T. Petrie. Statistical inference for probabilisticfunctions of finite state Markov chains. Annals of MathematicalStatistics, 37:1554–1563, 1966.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,and Robert L. Mercer. The mathematics of statistical machinetranslation: parameter estimation. Computational Linguistics, 19(2):263–311, June 1993. ISSN 0891-2017. URLhttp://dl.acm.org/citation.cfm?id=972470.972474.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe Royal Statistical Society, 39(1):1–38, 1977.
30 / 29
![Page 65: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/65.jpg)
References II
Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast,and effective reparameterization of ibm model 2. In Proceedingsof the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 644–648, Atlanta, Georgia, June 2013.Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/N13-1073.
Adrien Lardilleux and Yves Lepage. Sampling-based multilingualalignment. In Proceedings of the International ConferenceRANLP-2009, pages 214–218, Borovets, Bulgaria, September2009. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/R09-1040.
31 / 29
![Page 66: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/66.jpg)
References III
Coskun Mermer and Murat Saraclar. Bayesian word alignment forstatistical machine translation. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages 182–187,Portland, Oregon, USA, June 2011. Association forComputational Linguistics. URLhttp://www.aclweb.org/anthology/P11-2032.
Philip Schulz and Wilker Aziz. Fast collocation-based bayesianhmm word alignment. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguistics:Technical Papers, pages 3146–3155, Osaka, Japan, December2016. The COLING 2016 Organizing Committee. URLhttp://aclweb.org/anthology/C16-1296.
32 / 29
![Page 67: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some](https://reader034.fdocuments.us/reader034/viewer/2022050601/5fa824c89dea16254169a9f1/html5/thumbnails/67.jpg)
References IV
Philip Schulz, Wilker Aziz, and Khalil Sima’an. Word alignmentwithout null words. In Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2:Short Papers), pages 169–174, Berlin, Germany, August 2016.Association for Computational Linguistics. URLhttp://anthology.aclweb.org/P16-2028.
Stephan Vogel, Hermann Ney, and Christoph Tillmann.HMM-based word alignment in statistical translation. InProceedings of the 16th Conference on ComputationalLinguistics - Volume 2, COLING ’96, pages 836–841,Stroudsburg, PA, USA, 1996. Association for ComputationalLinguistics. doi: 10.3115/993268.993313. URLhttp://dx.doi.org/10.3115/993268.993313.
33 / 29