A derivation of the sampling formulas for An Entity-Topic Model for Entity Linking [Han+...

3
A derivation of the sampling formulas for An Entity-Topic Model for Entity Linking [Han+ EMNLP-CoNLL12] and A Context-Aware Topic Model for Statistical Machine Translation [Su+ ACL15] Tomonari MASADA @ Nagasaki University September 17, 2015 The full joint distribution is obtained as follows. p(m, w, z, e, a, θ, ϕ, ψ, ξ|α, β, γ , ι) = D d=1 [ p(m d |e d , ψ)p(e d |z d , ϕ)p(z d |θ d )p(w d |a d , ξ)p(a d |e d ) ] · D d=1 p(θ d |α) · K k=1 p(ϕ k |β) · T t=1 p(ψ|γ ) · T t=1 p(ξ|ι) = D d=1 [{ M d i=1 p(m di |ψ e di )p(e di |ϕ z di )p(z di |θ d ) }{ N d n=1 p(w dn |ξ a dn )p(a dn |e d ) }] · D d=1 p(θ d |α) · K k=1 p(ϕ k |β) · T t=1 p(ψ t |γ ) · T t=1 p(ξ t |ι) = D d=1 [{ M d i=1 K k=1 T t=1 ( ψ t,m di ϕ k,t θ d,k ) ∆(z di =ke di =t) }{ N d n=1 T t=1 ( ξ t,w dn M d i=1 ∆(e di = t) M d ) ∆(a dn =t) }] · D d=1 p(θ d |α) · K k=1 p(ϕ k |β) · T t=1 p(ψ t |γ ) · T t=1 p(ξ t |ι) = D d=1 [{ M d i=1 T t=1 ψ ∆(e di =t) t,m di }{ M d i=1 K k=1 T t=1 ϕ ∆(z di =ke di =t) k,t }{ M d i=1 K k=1 θ ∆(z di =k) d,k }{ N d n=1 T t=1 ξ ∆(a dn =t) t,w dn }] · D d=1 [{ N d n=1 T t=1 (∑ M d i=1 ∆(e di = t) M d ) ∆(a dn =t) }] · D d=1 p(θ d |α) · K k=1 p(ϕ k |β) · T t=1 p(ψ t |γ ) · T t=1 p(ξ t |ι) = U u=1 T t=1 ψ Ct,u t,u · K k=1 T t=1 ϕ C k,t k,t · D d=1 K k=1 θ C d,k d,k · T t=1 V v=1 ξ Ct,v t,v · D d=1 T t=1 ( M d,t M d ) N d,t · D d=1 p(θ d |α) · K k=1 p(ϕ k |β) · T t=1 p(ψ t |γ ) · T t=1 p(ξ t |ι) , (1) where ∆(·) is 1 if the proposition in the parentheses is true and is 0 otherwise. N d,t and M d,t are defined as follows: N d,t N d n=1 ∆(a dn = t); M d,t M d i=1 ∆(e di = t). The C s are defined as follows: C t,u D d=1 M d i=1 ∆(e di = t m di = u); C k,t D d=1 M d i=1 ∆(z di = k e di = t); C d,k M d i=1 ∆(z di = k); C t,v D d=1 N d n=1 ∆(a dn = t w dn = v). 1

Transcript of A derivation of the sampling formulas for An Entity-Topic Model for Entity Linking [Han+...

A derivation of the sampling formulas for

An Entity-Topic Model for Entity Linking [Han+ EMNLP-CoNLL12]

and

A Context-Aware Topic Model for Statistical Machine Translation [Su+ ACL15]

Tomonari MASADA @ Nagasaki University

September 17, 2015

The full joint distribution is obtained as follows.

p(m,w, z, e,a,θ,ϕ,ψ, ξ|α,β,γ, ι)

=D∏

d=1

[p(md|ed,ψ)p(ed|zd,ϕ)p(zd|θd)p(wd|ad, ξ)p(ad|ed)

]

·D∏

d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·T∏

t=1

p(ψ|γ) ·T∏

t=1

p(ξ|ι)

=D∏

d=1

[{ Md∏i=1

p(mdi|ψedi)p(edi|ϕzdi

)p(zdi|θd)}{ Nd∏

n=1

p(wdn|ξadn)p(adn|ed)

}]

·D∏

d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·T∏

t=1

p(ψt|γ) ·T∏

t=1

p(ξt|ι)

=D∏

d=1

[{ Md∏i=1

K∏k=1

T∏t=1

(ψt,mdi

ϕk,tθd,k

)∆(zdi=k∧edi=t)}{ Nd∏n=1

T∏t=1

(ξt,wdn

∑Md

i=1 ∆(edi = t)

Md

)∆(adn=t)}]

·D∏

d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·T∏

t=1

p(ψt|γ) ·T∏

t=1

p(ξt|ι)

=D∏

d=1

[{ Md∏i=1

T∏t=1

ψ∆(edi=t)t,mdi

}{ Md∏i=1

K∏k=1

T∏t=1

ϕ∆(zdi=k∧edi=t)k,t

}{ Md∏i=1

K∏k=1

θ∆(zdi=k)d,k

}{ Nd∏n=1

T∏t=1

ξ∆(adn=t)t,wdn

}]

·D∏

d=1

[{ Nd∏n=1

T∏t=1

(∑Md

i=1 ∆(edi = t)

Md

)∆(adn=t)}]·

D∏d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·T∏

t=1

p(ψt|γ) ·T∏

t=1

p(ξt|ι)

=U∏

u=1

T∏t=1

ψCt,u

t,u ·K∏

k=1

T∏t=1

ϕCk,t

k,t ·D∏

d=1

K∏k=1

θCd,k

d,k ·T∏

t=1

V∏v=1

ξCt,v

t,v ·D∏

d=1

T∏t=1

(Md,t

Md

)Nd,t

·D∏

d=1

p(θd|α) ·K∏

k=1

p(ϕk|β) ·T∏

t=1

p(ψt|γ) ·T∏

t=1

p(ξt|ι) , (1)

where ∆(·) is 1 if the proposition in the parentheses is true and is 0 otherwise.

Nd,t and Md,t are defined as follows: Nd,t ≡∑Nd

n=1 ∆(adn = t); Md,t ≡∑Md

i=1 ∆(edi = t).

The Cs are defined as follows: Ct,u ≡∑D

d=1

∑Md

i=1 ∆(edi = t ∧mdi = u); Ck,t ≡∑D

d=1

∑Md

i=1 ∆(zdi =

k ∧ edi = t); Cd,k ≡∑Md

i=1 ∆(zdi = k); Ct,v ≡∑D

d=1

∑Nd

n=1 ∆(adn = t ∧ wdn = v).

1

We marginalize the multinomial parameters out.

p(m,w, z, e,a|α,β,γ, ι) =∫p(m,w, z, e,a,θ,ϕ,ψ, ξ|α,β,γ, ι)dθdϕdψdξ

=T∏

t=1

∏u Γ(Ct,u + γu)

Γ(Ct +∑

u γu)

Γ(∑

u γu)∏u Γ(γu)

·K∏

k=1

T∏t=1

∏t Γ(Ck,t + βt)

Γ(Ck +∑

t βt)

Γ(∑

t βt)∏t Γ(βt)

·D∏

d=1

K∏k=1

∏k Γ(Cd,k + αk)

Γ(Md +∑

k αk)

Γ(∑

k αk)∏k Γ(αk)

·T∏

t=1

V∏v=1

∏v Γ(Ct,v + ιv)

Γ(Ct +∑

v ιv)

Γ(∑

v ιv)∏v Γ(ιv)

·D∏

d=1

T∏t=1

(Md,t

Md

)Nd,t

(2)

We remove the ith mention in the dth document.

p(m−di,w, z−di, e−di,a|α,β,γ, ι)

=T∏

t=1

∏u Γ(C

−dit,u + γu)

Γ(C−dit +

∑u γu)

Γ(∑

u γu)∏u Γ(γu)

·K∏

k=1

T∏t=1

∏t Γ(C

−dik,t + βt)

Γ(C−dik +

∑t βt)

Γ(∑

t βt)∏t Γ(βt)

·D∏

d=1

K∏k=1

∏k Γ(C

−did,k + αk)

Γ(Md − 1 +∑

k αk)

Γ(∑

k αk)∏k Γ(αk)

·T∏

t=1

V∏v=1

∏v Γ(Ct,v + ιv)

Γ(Ct +∑

v ιv)

Γ(∑

v ιv)∏v Γ(ιv)

·D∏

d=1

T∏t=1

(M−di

d,t

Md − 1

)Nd,t

(3)

And add the mention of the same type with different latent variable values.

p(mdi, zdi = k, edi = t|m−di,w, z−di, e−di,a,α,β,γ, ι)

=p(mdi, zdi = k, edi = t,m−di,w, z−di, e−di,a|α,β,γ, ι)

p(m−di,w, z−di, e−di,a|α,β,γ, ι)

=Γ(C−di

t,mdi+ 1 + γmdi

)

Γ(C−dit + 1 +

∑u γu)

Γ(C−dit +

∑u γu)

Γ(C−dit,mdi

+ γmdi)·

Γ(C−dik,t + 1 + βt)

Γ(C−dik + 1 +

∑t βt)

Γ(C−dik +

∑t βt)

Γ(C−dik,t + βt)

·Γ(C−di

d,k + 1 + αk)

Γ(Md +∑

k αk)

Γ(Md − 1 +∑

k αk)

Γ(C−did,k + αk)

·(M−di

d,t + 1

Md

Md − 1

M−did,t

)Nd,t

=C−di

t,mdi+ γmdi

C−dit +

∑u γu

·C−di

k,t + βt

C−dik +

∑t βt

·C−di

d,k + αk

Md +∑

k αk·(M−di

d,t + 1

Md

Md − 1

M−did,t

)Nd,t

(4)

Therefore, zdi can be updated based on the following probabilities:

p(zdi = k|m,w,z−di, e,a,α,β,γ, ι)

=p(mdi, zdi = k, edi = t|m−di,w, z−di, e−di,a,α,β,γ, ι)∑Kk=1 p(mdi, zdi = k, edi = t|m−di,w, z−di, e−di,a,α,β,γ, ι)

=

[C−di

t,mdi+γmdi

C−dit +

∑u γu

· C−dik,t +βt

C−dik +

∑t βt

· C−did,k +αk

Md+∑

k αk·(

M−did,t +1

Md

Md−1

M−did,t

)Nd,t]

∑Kk=1

[C−di

t,mdi+γmdi

C−dit +

∑u γu

· C−dik,t +βt

C−dik +

∑t βt

· C−did,k +αk

Md+∑

k αk·(

M−did,t +1

Md

Md−1

M−did,t

)Nd,t]

∝C−di

k,t + βt

C−dik +

∑t βt

·C−di

d,k + αk

Md +∑

k αk(5)

Further, edi can be updated based on the following probabilities:

p(edi = t|m,w, z, e−di,a,α,β,γ, ι)

=p(mdi, zdi = k, edi = t|m−di,w, z−di, e−di,a,α,β,γ, ι)∑Tt=1 p(mdi, zdi = k, edi = t|m−di,w, z−di, e−di,a,α,β,γ, ι)

∝C−di

t,mdi+ γmdi

C−dit +

∑u γu

·C−di

k,t + βt

C−dik +

∑t βt

·(M−di

d,t + 1

M−did,t

)Nd,t

(6)

2

We remove the nth word token in the dth document.

p(m,w−dn, z, e,a−dn|α,β,γ, ι)

=

T∏t=1

∏u Γ(Ct,u + γu)

Γ(Ct +∑

u γu)

Γ(∑

u γu)∏u Γ(γu)

·K∏

k=1

T∏t=1

∏t Γ(Ck,t + βt)

Γ(Ck +∑

t βt)

Γ(∑

t βt)∏t Γ(βt)

·D∏

d=1

K∏k=1

∏k Γ(Cd,k + αk)

Γ(Md +∑

k αk)

Γ(∑

k αk)∏k Γ(αk)

·T∏

t=1

V∏v=1

∏v Γ(C

−dnt,v + ιv)

Γ(C−dnt +

∑v ιv)

Γ(∑

v ιv)∏v Γ(ιv)

·D∏

d=1

T∏t=1

(Md,t

Md

)N−dnd,t

(7)

And add the word token of the same word type with a different latent variable value.

p(wdn, adn = t|m,w−dn, z, e,a−dn,α,β,γ, ι)

=p(wdn, adn = t,m,w−dn, z, e,a−dn|α,β,γ, ι)

p(m,w−dn, z, e,a−dn|α,β,γ, ι)

=Γ(C−dn

t,wdn+ 1 + ιwdn

)

Γ(C−dnt + 1 +

∑v ιv)

Γ(C−dnt +

∑v ιv)

Γ(C−dnt,wdn

+ ιwdn)·(Md,t

Md

)N−dnd,t +1(

Md

Md,t

)N−dnd,t

=C−dn

t,wdn+ ιwdn

C−dnt +

∑v ιv

·(Md,t

Md

)(8)

Therefore, adn can be updated based on the following probabilities:

p(adn = t|m,w,z, e,a−dn,α,β,γ, ι)

p(wdn, adn = t|m,w−dn, z, e,a−dn,α,β,γ, ι)∑Tt=1 p(wdn, adn = t|m,w−dn, z, e,a−dn,α,β,γ, ι)

∝C−dn

t,wdn+ ιwdn

C−dnt +

∑v ιv

·(Md,t

Md

)(9)

3