Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s...

67
Lexical alignment: IBM models 1 and 2 MLE via EM for categorical distributions Wilker Aziz April 11, 2017

Transcript of Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s...

Page 1: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Lexical alignment: IBM models 1 and 2MLE via EM for categorical distributions

Wilker Aziz

April 11, 2017

Page 2: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation data

Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

Is there anything we could say about this language?

1 / 29

Page 3: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation data

Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

Is there anything we could say about this language?

1 / 29

Page 4: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 5: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 6: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 7: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 8: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 9: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 10: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 11: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Translation by analogy

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

Page 12: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Probabilistic lexical alignment models

This lecture is about operationalising this intuition

I through a probabilistic learning algorithm

I for a non-probabilistic approach see for example[Lardilleux and Lepage, 2009]

3 / 29

Page 13: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

4 / 29

Page 14: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Word-to-word alignments

Imagine you are given a text

the black dog o cao pretothe nice dog o cao amigothe black cat o gato preto

a dog chasing a cat um cao perseguindo um gato

5 / 29

Page 15: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Word-to-word alignments

Now imagine the French words were replaced by placeholders

the black dog F1 F2 F3

the nice dog F1 F2 F3

the black cat F1 F2 F3

a dog chasing a cat F1 F2 F3 F4 F5

5 / 29

Page 16: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Word-to-word alignments

Now imagine the French words were replaced by placeholders

the black dog F1 F2 F3

the nice dog F1 F2 F3

the black cat F1 F2 F3

a dog chasing a cat F1 F2 F3 F4 F5

and suppose our task is to have a model explain the original data

5 / 29

Page 17: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Word-to-word alignments

Now imagine the French words were replaced by placeholders

the black dog F1 F2 F3

the nice dog F1 F2 F3

the black cat F1 F2 F3

a dog chasing a cat F1 F2 F3 F4 F5

and suppose our task is to have a model explain the original databy generating each French word from exactly one English word

5 / 29

Page 18: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Generative story

For each sentence pair independently,

1. observe an English sentence e1, · · · , emand a French sentence length n

2. for each French word position j from 1 to n

2.1 select an English position aj2.2 conditioned on the English word eaj

, generate fj

We have introduced an alignmentwhich is not directly visible in the data

6 / 29

Page 19: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Generative story

For each sentence pair independently,

1. observe an English sentence e1, · · · , emand a French sentence length n

2. for each French word position j from 1 to n

2.1 select an English position aj2.2 conditioned on the English word eaj

, generate fj

We have introduced an alignmentwhich is not directly visible in the data

6 / 29

Page 20: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

7 / 29

Page 21: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (A1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Page 22: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Page 23: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Page 24: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, EA2 → F2) (A3, EA3 → F3)

7 / 29

Page 25: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, dog→ cao) (A3, EA3 → F3)

7 / 29

Page 26: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, dog→ cao) (2, EA3 → F3)

7 / 29

Page 27: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)

7 / 29

Page 28: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)

the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)

7 / 29

Page 29: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)

the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)

the black dog (a1, ea1 → f1) (a2, ea2 → f2) (a3, ea3 → f3)

7 / 29

Page 30: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

8 / 29

Page 31: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Mixture models: generative story

xz

m

I c mixture components

I each defines a distribution over the same data space XI plus a distribution over components themselves

Generative story

1. select a mixture component z ∼ P (Z)

2. generate an observation from it x ∼ P (X|Z = z)

9 / 29

Page 32: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Mixture models: generative story

xz

m

I c mixture components

I each defines a distribution over the same data space XI plus a distribution over components themselves

Generative story

1. select a mixture component z ∼ P (Z)

2. generate an observation from it x ∼ P (X|Z = z)

9 / 29

Page 33: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Mixture models: likelihood

xz

m

Incomplete-data likelihood

P (xm1 ) =

m∏i=1

P (xi) (1)

=m∏i=1

c∑z=1

P (X = xi, Z = z)︸ ︷︷ ︸complete-data likelihood

(2)

=

m∏i=1

c∑z=1

P (Z = z)P (X = xi|Z = z) (3)

10 / 29

Page 34: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Interpretation

Missing data

I Let Z take one of c mixture components

I Assume data consists of pairs (x, z)

I x is always observed

I y is always missing

Inference: posterior distribution over possible Z for each x

P (Z = z|X = x) =P (Z = z,X = x)∑c

z′=1 P (Z = z′, X = x)(4)

=P (Z = z)P (X = x|Z = z)∑c

z′=1 P (Z = z′)P (X = x|Z = z′)(5)

11 / 29

Page 35: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Interpretation

Missing data

I Let Z take one of c mixture components

I Assume data consists of pairs (x, z)

I x is always observed

I y is always missing

Inference: posterior distribution over possible Z for each x

P (Z = z|X = x) =P (Z = z,X = x)∑c

z′=1 P (Z = z′, X = x)(4)

=P (Z = z)P (X = x|Z = z)∑c

z′=1 P (Z = z′)P (X = x|Z = z′)(5)

11 / 29

Page 36: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Non-identifiability

Different parameter settings, same distribution

Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5

Z X = a X = b

1 0.2 0.8

2 0.7 0.3

P (X) 0.45 0.55

Z X = a X = b

1 0.7 0.3

2 0.2 0.8

P (X) 0.45 0.55

Problem for parameter estimation by hillclimbing

12 / 29

Page 37: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Non-identifiability

Different parameter settings, same distribution

Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5

Z X = a X = b

1 0.2 0.8

2 0.7 0.3

P (X) 0.45 0.55

Z X = a X = b

1 0.7 0.3

2 0.2 0.8

P (X) 0.45 0.55

Problem for parameter estimation by hillclimbing

12 / 29

Page 38: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}

Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Page 39: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θ

Likelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Page 40: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Page 41: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =

m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Page 42: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =

m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Page 43: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

MLE for categorical: estimation from fully observed data

Suppose we have complete data

I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}

Then, for a categorical distribution

P (X = x|Z = z) = θz,x

and n(z, x|Dcomplete) = count of (z, x) in Dcomplete

MLE solution:

θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)

14 / 29

Page 44: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

MLE for categorical: estimation from fully observed data

Suppose we have complete data

I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}

Then, for a categorical distribution

P (X = x|Z = z) = θz,x

and n(z, x|Dcomplete) = count of (z, x) in Dcomplete

MLE solution:

θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)

14 / 29

Page 45: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

MLE for categorical: estimation from incomplete data

Expectation-Maximisation algorithm [Dempster et al., 1977]

E-step:

I for every observation x, imagine that every possible latentassignment z happened with probability Pθ(Z = z|X = x)

Dcompleted = {(x, Z = 1), . . . , (x, Z = c) : x ∈ D}

15 / 29

Page 46: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

MLE for categorical: estimation from incomplete dataExpectation-Maximisation algorithm [Dempster et al., 1977]

M-step:

I reestimate θ as to climb the likelihood surface

I for categorical distributions P (X = x|Z = z) = θz,xz and x are categorical0 ≤ θz,x ≤ 1 and

∑x∈X θz,x = 1

θz,x =E[n(z → x|Dcompleted)]∑x′ E[n(z → x′|Dcompleted)]

(6)

=

∑mi=1

∑z′ P (z′|x(i))1z(z′)1x(x(i))∑m

i=1

∑x′∑

z′ P (z′|x(i))1z(z′)1x′(x(i))(7)

=

∑mi=1 P (z|x(i))1x(x(i))∑m

i=1

∑x′ P (z|x(i))1x′(x(i))

(8)

16 / 29

Page 47: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

17 / 29

Page 48: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: a constrained mixture model

f

a

n

em0

m

n

S

Constrained mixture model

I mixture components are English words

I but only English words that appear inthe English sentence can be assigned

I aj acts as an indicator for the mixturecomponent that generates Frenchword fj

I e0 is occupied by a special Nullcomponent

18 / 29

Page 49: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: a constrained mixture model

f

a

n

em0

m

n

S

Constrained mixture model

I mixture components are English words

I but only English words that appear inthe English sentence can be assigned

I aj acts as an indicator for the mixturecomponent that generates Frenchword fj

I e0 is occupied by a special Nullcomponent

18 / 29

Page 50: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: a constrained mixture model

f

a

n

em0

m

n

S

Constrained mixture model

I mixture components are English words

I but only English words that appear inthe English sentence can be assigned

I aj acts as an indicator for the mixturecomponent that generates Frenchword fj

I e0 is occupied by a special Nullcomponent

18 / 29

Page 51: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: a constrained mixture model

f

a

n

em0

m

n

S

Constrained mixture model

I mixture components are English words

I but only English words that appear inthe English sentence can be assigned

I aj acts as an indicator for the mixturecomponent that generates Frenchword fj

I e0 is occupied by a special Nullcomponent

18 / 29

Page 52: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Parameterisation

Alignment distribution: uniform

P (A|M = m,N = n) =1

m+ 1(9)

Lexical distribution: categorical

P (F |E = e) = Cat(F |θe) (10)

I where θe ∈ RvFI 0 ≤ θe,f ≤ 1

I∑

f θe,f = 1

19 / 29

Page 53: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: incomplete-data likelihood

f

a

n

em0

m

n

S

Incomplete-data likelihood

P (fn1 |em0 ) =

m∑a1=0

· · ·m∑

an=0

P (fn1 , an1 |eaj ) (11)

=

m∑a1=0

· · ·m∑

an=0

n∏j=1

P (aj |m,n)P (fj |eaj ) (12)

=n∏j=1

m∑aj=0

P (aj |m,n)P (fj |eaj ) (13)

20 / 29

Page 54: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

IBM1: posterior

Posterior

P (an1 |fn1 , em0 ) =P (fn1 , a

n1 |em0 )

P (fn1 |em0 )(14)

Factorised

P (aj |fn1 , em0 ) =P (aj |m,n)P (fj |eaj )∑mi=0 P (i|m,n)P (fj |ei)

(15)

21 / 29

Page 55: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

MLE via EM

E-step:

E[n(e→ f|An1 )] =

m∑a1=0

· · ·m∑

an=0

P (an1 |fn1 , em0 )n(e→ f|An1 ) (16)

=

m∑a1=0

· · ·m∑

An=0

n∏j=1

P (aj |fn1 , em0 )1e(eaj)1f(fj) (17)

=

n∏j=1

m∑i=0

P (Aj = i|fn1 , em0 )1e(ei)1f(fj) (18)

M-step:

θe,f =E[n(e→ f |An1 )]∑f ′ E[n(e→ f ′|An1 )]

(19)

22 / 29

Page 56: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

EM algorithm

Repeat until convergence to a local optimum

1. For each sentence pair

1.1 compute posterior per alignment link1.2 accumulate fractional counts

2. Normalise counts for each English word

23 / 29

Page 57: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

24 / 29

Page 58: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Alignment distribution

Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)

I one distribution for each tuple (j,m, n)

I support must include length of longest English sentence

I extremely over-parameterised!

Jump distribution [Vogel et al., 1996]

I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋

I P (Aj |m,n) = Cat(∆|λ)

I ∆ takes values from −longest to +longest

25 / 29

Page 59: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Alignment distribution

Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)

I one distribution for each tuple (j,m, n)

I support must include length of longest English sentence

I extremely over-parameterised!

Jump distribution [Vogel et al., 1996]

I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋

I P (Aj |m,n) = Cat(∆|λ)

I ∆ takes values from −longest to +longest

25 / 29

Page 60: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

26 / 29

Page 61: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Note on terminology: source/target vs French/English

From an alignment model perspective all that matters is

I we condition on one language and generate the other

I in IBM models terminology, we condition on English andgenerate French

From a noisy channel perspective, where we want to translate asource sentence fn1 into some target sentence em1I Bayes rule decomposes p(em1 |fn1 ) ∝ p(fn1 |em1 )p(em1 )

I train p(em1 ) and p(fn1 |em1 ) independently

I language model: p(em1 )

I alignment model: p(fn1 |em1 )

I note that the alignment model conditions on the targetsentence (English) and generates the source sentence (French)

27 / 29

Page 62: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Limitations of IBM1-2

I too strong independence assumptions

I categorical parameterisation suffers from data sparsity

I EM suffers from local optima

28 / 29

Page 63: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

Extensions

Fertility, distortion, and concepts [Brown et al., 1993]

Dirichlet priors and posterior inference [Mermer and Saraclar, 2011]

I + no Null words [Schulz et al., 2016]

I + HMM and efficient sampler [Schulz and Aziz, 2016]

Log-linear distortion parameters and variational Bayes[Dyer et al., 2013]

First-order dependency (HMM) [Vogel et al., 1996]

I E-step requires dynamic programming[Baum and Petrie, 1966]

29 / 29

Page 64: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

References I

L. E. Baum and T. Petrie. Statistical inference for probabilisticfunctions of finite state Markov chains. Annals of MathematicalStatistics, 37:1554–1563, 1966.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,and Robert L. Mercer. The mathematics of statistical machinetranslation: parameter estimation. Computational Linguistics, 19(2):263–311, June 1993. ISSN 0891-2017. URLhttp://dl.acm.org/citation.cfm?id=972470.972474.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe Royal Statistical Society, 39(1):1–38, 1977.

30 / 29

Page 65: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

References II

Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast,and effective reparameterization of ibm model 2. In Proceedingsof the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 644–648, Atlanta, Georgia, June 2013.Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/N13-1073.

Adrien Lardilleux and Yves Lepage. Sampling-based multilingualalignment. In Proceedings of the International ConferenceRANLP-2009, pages 214–218, Borovets, Bulgaria, September2009. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/R09-1040.

31 / 29

Page 66: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

References III

Coskun Mermer and Murat Saraclar. Bayesian word alignment forstatistical machine translation. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages 182–187,Portland, Oregon, USA, June 2011. Association forComputational Linguistics. URLhttp://www.aclweb.org/anthology/P11-2032.

Philip Schulz and Wilker Aziz. Fast collocation-based bayesianhmm word alignment. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguistics:Technical Papers, pages 3146–3155, Osaka, Japan, December2016. The COLING 2016 Organizing Committee. URLhttp://aclweb.org/anthology/C16-1296.

32 / 29

Page 67: Lexical alignment: IBM models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · Let’s assume we are confronted with a new language and luckily we managed to obtain some

References IV

Philip Schulz, Wilker Aziz, and Khalil Sima’an. Word alignmentwithout null words. In Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2:Short Papers), pages 169–174, Berlin, Germany, August 2016.Association for Computational Linguistics. URLhttp://anthology.aclweb.org/P16-2028.

Stephan Vogel, Hermann Ney, and Christoph Tillmann.HMM-based word alignment in statistical translation. InProceedings of the 16th Conference on ComputationalLinguistics - Volume 2, COLING ’96, pages 836–841,Stroudsburg, PA, USA, 1996. Association for ComputationalLinguistics. doi: 10.3115/993268.993313. URLhttp://dx.doi.org/10.3115/993268.993313.

33 / 29