Lexical alignment: IBM models 1 and 2MLE via EM for categorical distributions
Wilker Aziz
April 11, 2017
Translation data
Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
Is there anything we could say about this language?
1 / 29
Translation data
Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
Is there anything we could say about this language?
1 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Translation by analogy
the black dog � ~the nice dog � ∪the black cat � ~
a dog chasing a cat � / �
A few hypotheses:
I � ⇐⇒ dog
I � ⇐⇒ cat
I ~ ⇐⇒ black
I nouns seem to preceed adjectives
I determines are probably not expressed
I chasing may be expressed by /and perhaps this language is OVS
I or perhaps chasing is realised by a verb with swappedarguments
2 / 29
Probabilistic lexical alignment models
This lecture is about operationalising this intuition
I through a probabilistic learning algorithm
I for a non-probabilistic approach see for example[Lardilleux and Lepage, 2009]
3 / 29
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
4 / 29
Word-to-word alignments
Imagine you are given a text
the black dog o cao pretothe nice dog o cao amigothe black cat o gato preto
a dog chasing a cat um cao perseguindo um gato
5 / 29
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
5 / 29
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
and suppose our task is to have a model explain the original data
5 / 29
Word-to-word alignments
Now imagine the French words were replaced by placeholders
the black dog F1 F2 F3
the nice dog F1 F2 F3
the black cat F1 F2 F3
a dog chasing a cat F1 F2 F3 F4 F5
and suppose our task is to have a model explain the original databy generating each French word from exactly one English word
5 / 29
Generative story
For each sentence pair independently,
1. observe an English sentence e1, · · · , emand a French sentence length n
2. for each French word position j from 1 to n
2.1 select an English position aj2.2 conditioned on the English word eaj
, generate fj
We have introduced an alignmentwhich is not directly visible in the data
6 / 29
Generative story
For each sentence pair independently,
1. observe an English sentence e1, · · · , emand a French sentence length n
2. for each French word position j from 1 to n
2.1 select an English position aj2.2 conditioned on the English word eaj
, generate fj
We have introduced an alignmentwhich is not directly visible in the data
6 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (A1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (A2, EA2 → F2) (A3, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, EA2 → F2) (A3, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (A3, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, EA3 → F3)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)
7 / 29
Data augmentation
Observations:
the black dog o cao preto
Imagine data is made of pairs: (aj , fj) and eaj → fj
the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)
the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)
the black dog (a1, ea1 → f1) (a2, ea2 → f2) (a3, ea3 → f3)
7 / 29
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
8 / 29
Mixture models: generative story
xz
m
I c mixture components
I each defines a distribution over the same data space XI plus a distribution over components themselves
Generative story
1. select a mixture component z ∼ P (Z)
2. generate an observation from it x ∼ P (X|Z = z)
9 / 29
Mixture models: generative story
xz
m
I c mixture components
I each defines a distribution over the same data space XI plus a distribution over components themselves
Generative story
1. select a mixture component z ∼ P (Z)
2. generate an observation from it x ∼ P (X|Z = z)
9 / 29
Mixture models: likelihood
xz
m
Incomplete-data likelihood
P (xm1 ) =
m∏i=1
P (xi) (1)
=m∏i=1
c∑z=1
P (X = xi, Z = z)︸ ︷︷ ︸complete-data likelihood
(2)
=
m∏i=1
c∑z=1
P (Z = z)P (X = xi|Z = z) (3)
10 / 29
Interpretation
Missing data
I Let Z take one of c mixture components
I Assume data consists of pairs (x, z)
I x is always observed
I y is always missing
Inference: posterior distribution over possible Z for each x
P (Z = z|X = x) =P (Z = z,X = x)∑c
z′=1 P (Z = z′, X = x)(4)
=P (Z = z)P (X = x|Z = z)∑c
z′=1 P (Z = z′)P (X = x|Z = z′)(5)
11 / 29
Interpretation
Missing data
I Let Z take one of c mixture components
I Assume data consists of pairs (x, z)
I x is always observed
I y is always missing
Inference: posterior distribution over possible Z for each x
P (Z = z|X = x) =P (Z = z,X = x)∑c
z′=1 P (Z = z′, X = x)(4)
=P (Z = z)P (X = x|Z = z)∑c
z′=1 P (Z = z′)P (X = x|Z = z′)(5)
11 / 29
Non-identifiability
Different parameter settings, same distribution
Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5
Z X = a X = b
1 0.2 0.8
2 0.7 0.3
P (X) 0.45 0.55
Z X = a X = b
1 0.7 0.3
2 0.2 0.8
P (X) 0.45 0.55
Problem for parameter estimation by hillclimbing
12 / 29
Non-identifiability
Different parameter settings, same distribution
Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5
Z X = a X = b
1 0.2 0.8
2 0.7 0.3
P (X) 0.45 0.55
Z X = a X = b
1 0.7 0.3
2 0.2 0.8
P (X) 0.45 0.55
Problem for parameter estimation by hillclimbing
12 / 29
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}
Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θ
Likelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =
m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
Maximum likelihood estimation
Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations
P (D) =
m∏i=1
Pθ(X = x(i))
the score function is
l(θ) =
m∑i=1
logPθ(X = x(i))
then we chooseθ? = arg max
θl(θ)
13 / 29
MLE for categorical: estimation from fully observed data
Suppose we have complete data
I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}
Then, for a categorical distribution
P (X = x|Z = z) = θz,x
and n(z, x|Dcomplete) = count of (z, x) in Dcomplete
MLE solution:
θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)
14 / 29
MLE for categorical: estimation from fully observed data
Suppose we have complete data
I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}
Then, for a categorical distribution
P (X = x|Z = z) = θz,x
and n(z, x|Dcomplete) = count of (z, x) in Dcomplete
MLE solution:
θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)
14 / 29
MLE for categorical: estimation from incomplete data
Expectation-Maximisation algorithm [Dempster et al., 1977]
E-step:
I for every observation x, imagine that every possible latentassignment z happened with probability Pθ(Z = z|X = x)
Dcompleted = {(x, Z = 1), . . . , (x, Z = c) : x ∈ D}
15 / 29
MLE for categorical: estimation from incomplete dataExpectation-Maximisation algorithm [Dempster et al., 1977]
M-step:
I reestimate θ as to climb the likelihood surface
I for categorical distributions P (X = x|Z = z) = θz,xz and x are categorical0 ≤ θz,x ≤ 1 and
∑x∈X θz,x = 1
θz,x =E[n(z → x|Dcompleted)]∑x′ E[n(z → x′|Dcompleted)]
(6)
=
∑mi=1
∑z′ P (z′|x(i))1z(z′)1x(x(i))∑m
i=1
∑x′∑
z′ P (z′|x(i))1z(z′)1x′(x(i))(7)
=
∑mi=1 P (z|x(i))1x(x(i))∑m
i=1
∑x′ P (z|x(i))1x′(x(i))
(8)
16 / 29
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
17 / 29
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
IBM1: a constrained mixture model
f
a
n
em0
m
n
S
Constrained mixture model
I mixture components are English words
I but only English words that appear inthe English sentence can be assigned
I aj acts as an indicator for the mixturecomponent that generates Frenchword fj
I e0 is occupied by a special Nullcomponent
18 / 29
Parameterisation
Alignment distribution: uniform
P (A|M = m,N = n) =1
m+ 1(9)
Lexical distribution: categorical
P (F |E = e) = Cat(F |θe) (10)
I where θe ∈ RvFI 0 ≤ θe,f ≤ 1
I∑
f θe,f = 1
19 / 29
IBM1: incomplete-data likelihood
f
a
n
em0
m
n
S
Incomplete-data likelihood
P (fn1 |em0 ) =
m∑a1=0
· · ·m∑
an=0
P (fn1 , an1 |eaj ) (11)
=
m∑a1=0
· · ·m∑
an=0
n∏j=1
P (aj |m,n)P (fj |eaj ) (12)
=n∏j=1
m∑aj=0
P (aj |m,n)P (fj |eaj ) (13)
20 / 29
IBM1: posterior
Posterior
P (an1 |fn1 , em0 ) =P (fn1 , a
n1 |em0 )
P (fn1 |em0 )(14)
Factorised
P (aj |fn1 , em0 ) =P (aj |m,n)P (fj |eaj )∑mi=0 P (i|m,n)P (fj |ei)
(15)
21 / 29
MLE via EM
E-step:
E[n(e→ f|An1 )] =
m∑a1=0
· · ·m∑
an=0
P (an1 |fn1 , em0 )n(e→ f|An1 ) (16)
=
m∑a1=0
· · ·m∑
An=0
n∏j=1
P (aj |fn1 , em0 )1e(eaj)1f(fj) (17)
=
n∏j=1
m∑i=0
P (Aj = i|fn1 , em0 )1e(ei)1f(fj) (18)
M-step:
θe,f =E[n(e→ f |An1 )]∑f ′ E[n(e→ f ′|An1 )]
(19)
22 / 29
EM algorithm
Repeat until convergence to a local optimum
1. For each sentence pair
1.1 compute posterior per alignment link1.2 accumulate fractional counts
2. Normalise counts for each English word
23 / 29
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
24 / 29
Alignment distribution
Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)
I one distribution for each tuple (j,m, n)
I support must include length of longest English sentence
I extremely over-parameterised!
Jump distribution [Vogel et al., 1996]
I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋
I P (Aj |m,n) = Cat(∆|λ)
I ∆ takes values from −longest to +longest
25 / 29
Alignment distribution
Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)
I one distribution for each tuple (j,m, n)
I support must include length of longest English sentence
I extremely over-parameterised!
Jump distribution [Vogel et al., 1996]
I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋
I P (Aj |m,n) = Cat(∆|λ)
I ∆ takes values from −longest to +longest
25 / 29
Content
Lexical alignment
Mixture models
IBM model 1
IBM model 2
Remarks
26 / 29
Note on terminology: source/target vs French/English
From an alignment model perspective all that matters is
I we condition on one language and generate the other
I in IBM models terminology, we condition on English andgenerate French
From a noisy channel perspective, where we want to translate asource sentence fn1 into some target sentence em1I Bayes rule decomposes p(em1 |fn1 ) ∝ p(fn1 |em1 )p(em1 )
I train p(em1 ) and p(fn1 |em1 ) independently
I language model: p(em1 )
I alignment model: p(fn1 |em1 )
I note that the alignment model conditions on the targetsentence (English) and generates the source sentence (French)
27 / 29
Limitations of IBM1-2
I too strong independence assumptions
I categorical parameterisation suffers from data sparsity
I EM suffers from local optima
28 / 29
Extensions
Fertility, distortion, and concepts [Brown et al., 1993]
Dirichlet priors and posterior inference [Mermer and Saraclar, 2011]
I + no Null words [Schulz et al., 2016]
I + HMM and efficient sampler [Schulz and Aziz, 2016]
Log-linear distortion parameters and variational Bayes[Dyer et al., 2013]
First-order dependency (HMM) [Vogel et al., 1996]
I E-step requires dynamic programming[Baum and Petrie, 1966]
29 / 29
References I
L. E. Baum and T. Petrie. Statistical inference for probabilisticfunctions of finite state Markov chains. Annals of MathematicalStatistics, 37:1554–1563, 1966.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,and Robert L. Mercer. The mathematics of statistical machinetranslation: parameter estimation. Computational Linguistics, 19(2):263–311, June 1993. ISSN 0891-2017. URLhttp://dl.acm.org/citation.cfm?id=972470.972474.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe Royal Statistical Society, 39(1):1–38, 1977.
30 / 29
References II
Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast,and effective reparameterization of ibm model 2. In Proceedingsof the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 644–648, Atlanta, Georgia, June 2013.Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/N13-1073.
Adrien Lardilleux and Yves Lepage. Sampling-based multilingualalignment. In Proceedings of the International ConferenceRANLP-2009, pages 214–218, Borovets, Bulgaria, September2009. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/R09-1040.
31 / 29
References III
Coskun Mermer and Murat Saraclar. Bayesian word alignment forstatistical machine translation. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages 182–187,Portland, Oregon, USA, June 2011. Association forComputational Linguistics. URLhttp://www.aclweb.org/anthology/P11-2032.
Philip Schulz and Wilker Aziz. Fast collocation-based bayesianhmm word alignment. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguistics:Technical Papers, pages 3146–3155, Osaka, Japan, December2016. The COLING 2016 Organizing Committee. URLhttp://aclweb.org/anthology/C16-1296.
32 / 29
References IV
Philip Schulz, Wilker Aziz, and Khalil Sima’an. Word alignmentwithout null words. In Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2:Short Papers), pages 169–174, Berlin, Germany, August 2016.Association for Computational Linguistics. URLhttp://anthology.aclweb.org/P16-2028.
Stephan Vogel, Hermann Ney, and Christoph Tillmann.HMM-based word alignment in statistical translation. InProceedings of the 16th Conference on ComputationalLinguistics - Volume 2, COLING ’96, pages 836–841,Stroudsburg, PA, USA, 1996. Association for ComputationalLinguistics. doi: 10.3115/993268.993313. URLhttp://dx.doi.org/10.3115/993268.993313.
33 / 29
Top Related