A Note on PCVB0 for HDP-LDA

4

Click here to load reader

description

This note gives a derivation of the variational posterior updates presented in the following paper: Sato, Issei and Kurihara, Kenichi and Nakagawa, Hiroshi, Practical Collapsed Variational Bayes Inference for Hierarchical Dirichlet Process, in Proc. of KDD '12.

Transcript of A Note on PCVB0 for HDP-LDA

Page 1: A Note on PCVB0 for HDP-LDA

A Note on PCVB0 for HDP-LDA

Tomonari MASADA @ Nagasaki University

August 22, 2014

This note gives a derivation of the variational posterior updates presented in the following paper:

Sato, Issei and Kurihara, Kenichi and Nakagawa, Hiroshi,Practical Collapsed Variational Bayes Inference for Hierarchical Dirichlet Process,in Proc. of KDD ’12.

A lower bound of the log of the evidence p(w) is obtained as follows:

ln p(w|α0, β0, γ0, τ ) = ln

∫ ∑z

p(z,w|α0, β0,π, τ )p(π|γ0)dπ

= ln

∫ ∑z

q(z)q(π)p(z,w|α0, β0,π, τ )p(π|γ0)

q(z)q(π)dπ

≥∫ ∑

z

q(z)q(π) lnp(z,w|α0, β0,π, τ )p(π|γ0)

q(z)q(π)dπ (Jensen’s inequality)

=

∫ ∑z

q(z)q(π) ln p(z,w|α0, β0,π, τ )dπ −∑z

q(z) ln q(z) +

∫q(π) ln p(π|γ0)dπ −

∫q(π) ln q(π)dπ .

(1)

Based on the proposed approximation, we can approximate the joint probability of z and w by

p(z,w|α0, β0,π, τ )

=

[ N∏d=1

Γ(α0)

Γ(α0 + nd)

T∏k=1

[Γ(nd,k)α0πk]I(nd,k>0)

][ T∏k=1

Γ(β0)

Γ(β0 + nk,·)

V∏v=1

[Γ(nk,v)β0τv]I(nk,v>0)

](2)

as in Eq. (21) of the paper.

The first term of the lower bound in Eq. (2) can be rewritten as follows:∑z

∫q(z)q(π) ln p(z,w|α0, β0,π, τ )dπ

=∑z

∫ { T∏k=1

q(π̃k)}q(z) ln

[ N∏d=1

Γ(α0)

Γ(α0 + nd)

T∏k=1

[Γ(nd,k)α0πk]I(nd,k>0)

]dπ̃1 · · · dπ̃T

+∑z

q(z) ln

[ T∏k=1

Γ(β0)

Γ(β0 + nk,·)

V∏v=1

[Γ(nk,v)β0τv]I(nk,v>0)

]. (3)

The first term of the right hand side in Eq. (3) can be rewritten as follows:∑z

∫ { T∏k=1

q(π̃k)}q(z) ln

[ N∏d=1

Γ(α0)

Γ(α0 + nd)

T∏k=1

[Γ(nd,k)α0πk]I(nd,k>0)

]dπ̃1 · · · dπ̃T

= N ln Γ(α0)−N∑

d=1

ln Γ(α0 + nd) +∑z

q(z) lnN∏

d=1

T∏k=1

[Γ(nd,k)]I(nd,k>0)

+∑z

∫ { T∏k=1

q(π̃k)}q(z) ln

[ N∏d=1

T∏k=1

[α0πk]I(nd,k>0)

]dπ̃1 · · · dπ̃T . (4)

1

Page 2: A Note on PCVB0 for HDP-LDA

The last term of the right hand side in Eq. (4) can be rewritten as follows:

∑z

∫ { T∏k=1

q(π̃k)}q(z) ln

[ N∏d=1

T∏k=1

[α0πk]I(nd,k>0)

]dπ̃1 · · · dπ̃T

=∑z

∫ { T∏k=1

q(π̃k)}q(z)

[ N∑d=1

T∑k=1

I(nd,k > 0)[lnα0 + lnπk]

]dπ̃1 · · · dπ̃T

= lnα0

∑z

q(z)

[ N∑d=1

T∑k=1

I(nd,k > 0)

]

+∑z

∫ { T∏k=1

q(π̃k)}q(z)

[ T∑k=1

{ N∑d=1

I(nd,k > 0)}lnπk

]dπ̃1 · · · dπ̃T

= lnα0

∑d

∑k

E[I(nd,k > 0)] +

∫ { T∏k=1

q(π̃k)}{∑

k

∑d

E[I(nd,k > 0)] lnπk

}dπ̃1 · · · dπ̃T , (5)

where E[I(nd,k > 0)] = 1−∏nd

i=1

{1−q(zd,i = k)

}. The second term in Eq. (5) can be rewritten as follows:

∫ { T∏k=1

q(π̃k)}{∑

k

∑d

E[I(nd,k > 0)] lnπk

}dπ̃1 · · · dπ̃T

=

∫ { T∏k=1

q(π̃k)}[∑

k

∑d

E[I(nd,k > 0)] ln{π̃k

k−1∏l=1

(1− π̃l)}]dπ̃1 · · · dπ̃T

=T−1∑k=1

∫q(π̃k)

[{∑d

E[I(nd,k > 0)]}ln π̃k +

{ T∑l=k+1

∑d

E[I(nd,l > 0)]}ln(1− π̃k)

]dπ̃k

=

T−1∑k=1

∫q(π̃k)

{∑d

E[I(nd,k > 0)]}ln π̃kdπ̃k +

T−1∑k=1

∫q(π̃k)

{ T∑l=k+1

∑d

E[I(nd,l > 0)]}ln(1− π̃k)dπ̃k ,

(6)

where it should be noted that π̃T ≡ 1.

The second term of the right hand side in Eq. (3) can be rewritten as follows:

∑z

q(z) ln

[ T∏k=1

Γ(β0)

Γ(β0 + nk,·)

V∏v=1

[Γ(nk,v)β0τv]I(nk,v>0)

]

= T ln Γ(β0)−∑z

q(z)

{ T∑k=1

ln Γ(β0 + nk,·)

}+∑z

q(z)

[ T∑k=1

V∑v=1

I(nk,v > 0) ln{Γ(nk,v)β0τv

}]. (7)

The last term of the right hand side in Eq. (7) can be rewritten as follows:

∑z

q(z)

[ T∑k=1

V∑v=1

I(nk,v > 0) ln{Γ(nk,v)β0τv

}]

= lnβ0

T∑k=1

V∑v=1

E[I(nk,v > 0)] +V∑

v=1

ln τv

T∑k=1

E[I(nk,v > 0)] +∑z

q(z)

[ T∑k=1

V∑v=1

I(nk,v > 0) ln Γ(nk,v)

].

(8)

2

Page 3: A Note on PCVB0 for HDP-LDA

The third term of the lower bound in Eq. (2) can be rewritten as follows:∫q(π) ln p(π|γ0)dπ =

T−1∑k=1

∫q(π̃k) ln

{γ0(1− π̃k)γ0−1

}dπ̃k

= (T − 1) ln γ0 + (γ0 − 1)

T−1∑k=1

E[ln(1− π̃k)]

= (T − 1) ln γ0 + (γ0 − 1)

T−1∑k=1

{ψ(bk)− ψ(ak + bk)} , (9)

and the last term of the lower bound in Eq. (2) can be rewritten as follows:

−∫q(π) ln q(π)dπ = −

T=1∑k=1

∫q(π̃k) ln q(π̃k)dπ̃k . (10)

We would like to know the distribution of π̃k. For simplicity, we denote∑

d E[I(nd,k > 0)] and∑Tl=k+1

∑d E[I(nd,l > 0)] by R and S, respectively. We obtain the functional derivative of the sum of

Eq. (6) and Eq. (10) as follows:

δ

δq(π̃′k)

∫q(π̃k)

{R ln π̃k + S ln(1− π̃k)dπ̃k − ln q(π̃k)

}dπ̃k

= limϵ→0

1

ϵ

∫ ({q(π̃k) + ϵδ(π̃k − π̃′

k)}[R ln π̃k + S ln(1− π̃k)dπ̃k − ln

{q(π̃k) + ϵδ(π̃k − π̃′

k)}]

− q(π̃k)[R ln π̃k + S ln(1− π̃k)− ln q(π̃k)

])dπ̃k

= limϵ→0

1

ϵ

∫ [R ln π̃′

k + S ln(1− π̃′k)− ln q(π̃′

k)− q(π̃k){ln{q(π̃k) + ϵδ(π̃k − π̃′

k)}− ln q(π̃k)

}]dπ̃k

= limϵ→0

1

ϵ

∫ {R ln π̃′

k + S ln(1− π̃′k)− ln q(π̃′

k)− q(π̃k) lnq(π̃k) + ϵδ(π̃k − π̃′

k)

q(π̃k)

}dπ̃k

= limϵ→0

1

ϵ

∫ [R ln π̃′

k + S ln(1− π̃′k)− ln q(π̃′

k)− q(π̃k){ϵδ(π̃k − π̃′

k)

q(π̃k)+O(ϵ2)

}]dπ̃k

= R ln π̃′k + S ln(1− π̃′

k)− ln q(π̃′k)− 1 . (11)

Therefore, q(π̃k) ∝ π̃Rk (1 − π̃k)S . This implies that q(π̃k) is a density function of the Beta distribution.

We parametrize it as π̃k ∼ Beta(ak, bk) and obtain the following result:

ak = 1 +N∑

d=1

E[I(nd,k > 0)] , bk = 1 +T∑

l=k+1

N∑d=1

E[I(nd,l > 0)] . (12)

By the way, based on Eq. (1), we obtain the following conditional distribution:

p(zd,i = k|w, z−d,i, α0, β0,π, τ )

∝{I(n−d,i

d,k > 0)n−d,id,k + I(n−d,i

d,k = 0)α0πk

}I(n−d,ik,wd,i

> 0)n−d,ik,wd,i

+ I(n−d,ik,wd,i

= 0)β0τwd,i

n−d,ik + β0

. (13)

However, this does not lead to the variational posterior q(zd,i = k) given in the original paper (cf. Eq. (34)and Eq. (35)). When we adopt the conditional

p(zd,i = k|w,z−d,i, α0, β0,π, τ ) ∝(n−d,id,k + α0πk

)n−d,ik,wd,i

+ β0τwd,i

n−d,ik + β0

(14)

as usual and use E[πk] for approximating the posterior q(zd,i = k), we obtain the result given in the paper.

3

Page 4: A Note on PCVB0 for HDP-LDA

Let the lower bound in Eq. (2) be denoted as L. We differentiate L with respect to α0:

∂L

∂α0= Nψ(α0)−

N∑d=1

ψ(α0 + nd) +

∑d

∑k E[I(nd,k > 0)]

α0(15)

∂L∂α0

= 0 gives the following update:

α0 ←∑

d

∑k E[I(nd,k > 0)]∑

d ψ(α0 + nd)−Nψ(α0). (16)

In a similar manner, we obtain the following update:

β0 ←∑

k

∑v E[I(nk,v > 0)]∑

k ψ(β0 + nk,·)− Tψ(β0), (17)

where E[I(nk,v > 0)] = 1−∏

d

∏i I(wd,i = v)q(zd,i ̸= k).

For τv, we assume it is a multinomial parameter and obtain the following update:

τv =

∑k E[I(nk,v > 0)]∑

v

∑k E[I(nk,v > 0)]

. (18)

By differentiating Eq. (9) with respect to γ0, we obtain the following update:

γ0 =T − 1∑T−1

k=1 {ψ(ak + bk)− ψ(bk)}. (19)

4