Chapter 06: Variational Bayesian...

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 06: Variational Bayesian Inference

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 06 – p. 1/38

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 10,Springer, 2006

◮ Charles Fox and Stephen Roberts, A Tutorial on Variational BayesianInference, In: Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95, 2012

◮ John Winn and Christopher M. Bishop, Variational Message Passing, In:Journal of Machine Learning Research, vol. 6, pp- 661-694, 2005http://machinelearning.wustl.edu/mlpapers/paper_files/WinnB05.pdf


Approximative solutions

Observations:

◮ inference on Bayesian networks can be done analytically for polytreescombined with special distributions (categorical, Gauss-linear)

◮ inference is not treatable analytically for the general case

◮ requires numerical or approximative solutions


Approximative inference

Joint probability distribution for many Bayesian networks is pretty complicated andhard to treat analytically

Goal: find a simpler joint probability distribution that approximates the original oneand that can be treated analytically

Example:

n

µ

σ2

Xi

conjugate prior for Gaussians

n

µ

σ2

Xi

desirable, simpler prior for Gaussians


Approximative inference

How can we measure whether two distributions are similar?

Definition: the Kullback-Leibler divergence is an unsymmetric measure for thedissimilarity of two probability distributions. It is defined by

KL(p||q) =∫ ∞

−∞p(x) log

p(x)

q(x)dx

Properties:

◮ KL(p||p) = 0

◮ KL(p||q) ≥ 0 with equality only if p = q almost everywhere

◮ KL(p||q) 6= KL(q||p)◮ KL(p||q) +KL(q||r) 6≥ KL(p||r)


Kullback-Leibler-divergence

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

−10 −5 0 5 10

p(x)

argminq KL(p||q)

argminq KL(q||p)

Question: which Gaussian q minimizes the Kullback-Leibler-DivergenceKL(p||q) and KL(q||p), respectivly ?


Variational Bayesian inference

◮ assume we are given a complex distribution p(U |O) where U are theunobserved variables and O the observed variables

◮ we want to approximate it with a parameterized distribution q(U |θ) with θthe set of parameters

◮ we assume that q can be factorized q(U |θ) = ∏

i qi(Ui|θi) with qi aconditional distribution

◮ how should we choose θ to obtain the best approximation?

minimizeθ

KL(q(U |θ)||p(U |O))



KL(q(U |θ)||p(U |O)) =∫

q(u|θ) log q(u|θ)p(u|O)du

=−∫

q(u|θ) log p(u|O)q(u|θ) du

=−∫

q(u|θ) log p(u,O)

p(o) · q(u|θ)du

=−∫

q(u|θ) log p(u,O)q(u|θ) −q(u|θ) log p(o)du

=−∫

q(u|θ) log p(u,O)q(u|θ) du

︸︷︷︸

=:L(θ)

+ log p(o)

Observe that p(o) is independent w.r.t. θ.

Hence, we need to maximize L(θ)



L(θ) =

∫

(n∏

i=1


−n∑

j=1

(∏

i 6=j

(∫

qi(ui|θi)dui)

︸︷︷︸

=1

·∫


=

∫

(n∏

i=1


−n∑

j=1

(∫




Select one (arbitrary) factor k:

L(θ) =

∫

(n∏

i=1


−n∑

j=1

(∫


=

∫

qk(uk|θk)(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1

duk+1 . . . dun

)

duk −n∑

j=1

(∫


=

∫

qk(uk|θk)log(

exp(∫

· · ·∫

log p(u,O)∏


duk+1 . . . dun

))

duk −n∑

j=1

(∫




L(θ) =

∫

qk(uk|θk) log(

exp(∫

· · ·∫

log p(u,O)∏


duk+1 . . . dun

))

duk −n∑

j=1

(∫


=

∫

qk(uk|θk) log q∗k(uk)duk+logZ−n∑

j=1

(∫


with

Z=

∫

exp(∫

· · ·∫

log p(u,O)∏

i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun

)

duk

q∗k(uk)=1

Zexp

(∫

· · ·∫

log p(u,O)∏


)

Z serves as normalization constant so that q∗k becomes a density function of aGibbs distribution.



L(θ) =

∫

qk(uk|θk) log q∗k(uk)duk+logZ−n∑

j=1

(∫


=

∫

qk(uk|θk) log q∗k(uk)duk −∫

qk(uk|θk) log qk(uk|θk)duk

+ logZ −∑

j 6=k

(∫


=−KL(qk(uk|θk)||q∗k(uk)) + logZ −∑

j 6=k

(∫


We want to maximize L. If we keep all θi fixed except of θk we should choose θkso that qk(uk|θk) = q∗k(uk)

If we apply this idea repeatedly cycling through all possible values of k, we obtainan iterative algorithm that converges to a local maximum of L(θ)



Algorithm:

1. start with arbitrary parameter set θ

2. repeat

3. for k ← 1, . . . , n do

4. select θk so that qk(uk|θk) = q∗k(uk)

5. endfor

6. until convergence of θ

7. return θ



A closer look at q∗k(uk)

q∗k(uk)=1

Zexp

(∫

· · ·∫

log p(u,O)∏


)

If p was derived from a Bayesian network, we know that it factors into terms whichbelong to the Markov blanket and other terms, i.e.p(u,O) = p′(u,O) · p′′(u,O) and the second term does not depend on uk.

q∗k(uk) =1

Zexp

(∫

· · ·∫

log p′(u,O)∏


+

∫

· · ·∫

log p′′(u,O)∏


︸︷︷︸

constant w .r .t . uk

)

=1

Z ′ exp(∫

· · ·∫

log p′(u,O)∏

i∈blanket(k)

qj(uj|θj)d{ui|i ∈ blanket(k)})


Example: Gaussian

Example: a sample from a Gaussian with unknown parameters

n

m0 r0 a0 b0

µ s

Xi

µ∼N (m0, r0)

s∼ Γ−1(a0, b0)

Xi ∼N (µ, s)

p(µ, s, x1, . . . , xn) =1√2πr0

e− 1

2(µ−m0)

2

r0

︸︷︷︸

=:dµ(µ)

·

ba00Γ(a0)

s−a0−1e−b0s

︸︷︷︸

=:ds(s)

·n∏

i=1

1√2πs

e−12

(xi−µ)2

s

︸︷︷︸

=:d(µ,s)


Example: Gaussian

q∗µ(µ)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds

)

q∗s(s)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qµ(µ|m, r)dµ

)


Example: Gaussian

Side calculation:

d(µ, s) =n∏

i=1

1√2πs

e−12

(xi−µ)2

s

=1

(2π)n2 s

n2

e−12s

(∑x2i−2µ

∑xi+nµ

2)

=1

(2π)n2 s

n2

e−n2s

(µ2−2µ∑

xin

+∑

x2in

)

=1

(2π)n2 s

n2

e−n2s

(µ2−2µ∑

xin

+(∑

xin

)2−(∑

xin

)2+∑

x2in

)

=1

(2π)n2 s

n2

e− 1

2(µ−x̄)2+Vx

sn

with x̄ the mean and Vx the variance of Xi

x̄=1

n

∑

xi

Vx =1

n

∑

x2i − (1

n

∑

xi)2


Example: Gaussian

q∗µ(µ)∝ exp

(∫ ∞

−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds

)

= exp

(∫ ∞

−∞log(dµ(µ) + ds(s) + d(µ, s)) · qs(s|a, b)ds

)

log q∗µ(µ) = const(µ) +

∫ ∞

−∞log dµ(µ)qs(s|a, b)ds

+

∫ ∞

−∞log ds(s) · qs(s|a, b)ds

︸︷︷︸

=const(µ)

+

∫ ∞

−∞log d(µ, s) · qs(s|a, b)ds

= const(µ) + log dµ(µ)

∫ ∞

−∞qs(s|a, b)ds

︸︷︷︸

=1

+

∫ ∞



Example: Gaussian

∫ ∞


=

∫ ∞

−∞log

( 1

(2π)n2 s

n2

e− 1

2(µ−x̄)2+Vx

sn

)· ba

Γ(a)s−a−1e−

bsds

=

∫ ∞

−∞log

1

(2π)n2 s

n2

· ba

Γ(a)s−a−1e−

bsds

︸︷︷︸

=const(µ)

+

∫ ∞

−∞log e

− 12

(µ−x̄)2+Vxsn · ba

Γ(a)s−a−1e−

bsds

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vxsn

)· ba

Γ(a)s−a−1e−

bsds


Example: Gaussian

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vxsn

)· ba

Γ(a)s−a−1e−

bsds

= const(µ) +

∫ ∞

−∞

(− 1

2

(µ− x̄)2 + Vx1n

)· Γ(a+ 1)

Γ(a) · b ·ba+1

Γ(a+ 1)s−(a+1)−1e−

bsds

= const(µ) +(− 1

2

(µ− x̄)2 + Vx1n

)· ab·∫ ∞

−∞

ba+1

Γ(a+ 1)s−(a+1)−1e−

bsds

︸︷︷︸

=1

= const(µ) +a

b·(− 1

2

(µ− x̄)21n

)+a

b·(− 1

2

Vx1n

)

︸︷︷︸

=const(µ)

= const(µ) +a

b·(− 1

2

(µ− x̄)21n

)


Example: Gaussian

Assembling all pieces:

q∗µ(µ)∝ dµ(µ) · exp(a

b·(− 1

2

(µ− x̄)21n

))

∝ exp(

− 1

2

((µ−m0)2

r0+

(µ− x̄)2ban

))

= exp(

− 1

2

ban(µ2 − 2µm0 +m2

0) + r0(µ2 − 2µx̄+ x̄2)

r0ban

)

= exp(

− 1

2

(r0 +ban)µ2 − 2µ( bm0

an+ r0x̄) +

bm20

an+ r0x̄

2

r0ban

)

= exp(

− 1

2

(µ−bm0an

+r0x̄

r0+ban

)2−(bm0an

+r0x̄

r0+ban

)2 +bm2

0an

+r0x̄2

r0+ban

r0ban

r0+ban

)


Example: Gaussian

q∗µ(µ)∝ exp(

− 1

2

(µ−bm0an

+r0x̄

r0+ban

)2−(bm0an

+r0x̄

r0+ban

)2 +bm2

0an

+r0x̄2

r0+ban

r0ban

r0+ban

)

∝ exp(

− 1

2

(µ− bm0+r0anx̄r0an+b

)2

r0br0an+b

)

Comapring q∗µ(µ) with parameterized form of qµ(µ|m, r) = 1√2πre−

12

(µ−m)2

r

yields

m← bm0 + r0anx̄

r0an+ b

r← r0b

r0an+ b


Example: Gaussian

With a similar calculation, we obtain

q∗s(s)∝ s−(a0+n2)−1e−

b0+n2 (Vx+(m−x̄)2+r)

s

Comapring q∗s(s) with parameterized form of qs(s|a, b) = ba

Γ(a)s−a−1e−

bs yields

a← a0 +n

2

b← b0 +n

2(Vx + (m− x̄)2 + r)


Example: Gaussian

Algorithm:

1. start with arbitrary values of m, r, a, b

2. repeat

3. set m← bm0+r0anx̄r0an+b

4. set r ← r0br0an+b

5. set a← a0 +n2

6. set b← b0 +n2(Vx + (m− x̄)2 + r)

7. until convergence of (m, r, a, b)

8. return θ


Example: Gaussian

Experiment:

◮ generate sample (n = 20) fromN (10, 9)

◮ use priors for µ and s close to non-informativity

◮ apply 10 iterations of variational Bayes

0 5 10 15 20 250

0.02

0.04

0.06

0.08

0.1

0.12

0.14 blue: original sample distributionblack: sample pointsgreen: ML estimatorred: MAP estimator aftervariational Bayes


Example: Gaussian

Comparison: full posterior vs. variational posterior

2

7

9 15

√s

µ

full posterior

2

7

9 15

√s

µ

variational approximation


Example: Gaussian

◮ bad experience: the calculations are lengthy and error prone

◮ good news: the equations can be resolved whenever the distributionsinvolved are from the exponential family, e.g. Gaussian, Gamma, InverseGamma, Wishart, Inverse Wishart, Dirichlet, Beta, Categorical.

◮ semi-good news: there is a message passing algorithm (variational messagepassing) that implements variational inference for exponential familydistributions. But it is hard to understand and sometimes it is easier to do allthe calculations manually (Winn, 2005)


Mixture distributions

How can we model distributions that are different and more complex thanstandard distribution families?

◮ search the literature for other distributions

◮ combine distributions

• combine distributions of different kind (unusual)

• combine distributions of same kind

→ mixture distributions, e.g. mixture of Gaussian, mixture of Dirichlet, ...



How do we combine distributions to a mixture?

Requirements for a density function:

◮ f(x) ≥ 0 for all x ∈ R

◮

∫∞−∞ f(x)dx = 1

Observation: If f and g are pdfs and 0 < w < 1, thenx 7→ w · f(x) + (1− w)g(x) is also a pdf.

More general, if f1, . . . , fk are pdfs and w1, . . . , wk nonnegative numbers with∑k

j=1wj = 1, then x 7→∑k

j=1(wjfj(x)) is a pdf. Such a distribution is called

a mixture distribution.

◮ fj is the j-th component of the mixture

◮ wj serves as mixing weight and models the amount of contribution of fj tothe mixture



Interpretation of mixture distributions as structured distributions

◮ each component models one class (category) which is descibed by fj

◮ class j contributes with a ratio of wj to the whole

◮ each sample element xi of a mixture belongs to one component. But, we donot know to which one

Introduce latent variables zi which models to which component xi belongs

k

n

fj

Xi

Zi

~w

Zi ∼ C(~w)Xi|Zi ∼ fZi


Example: Gaussian mixture

k

n

µj sj

Xi

Zi

~w

Zi ∼ C(~w)Xi|Zi ∼N (µZi

, sZi)

How can we apply variational Bayesian inference in Gaussian mixtures?


Variational Bayes for Gaussian mixture

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~βµj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)

p(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, x1, . . . , xn, z1, . . . , zn)

=k∏

j=1

( 1√2πr0

e− 1

2

(µj−m0)2

r0

)·

k∏

j=1

( ba00Γ(a0)

s−a0−1j e

− b0sj

)

·Γ(∑k

j=1 βj)∏k

j=1 Γ(βj)

k∏

j=1

wβj−1j ·

n∏

i=1

wzi ·n∏

i=1

( 1√2πszi

e− 1

2

(xi−µzi)2

szi

)



Modeling full posterior by variational approximation

q(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, z1, . . . , zn|m1, . . . ,mk,

r1, . . . , rk, a1, . . . , ak, b1, . . . , bk, α1, . . . , αk, h1,1, . . . , hn,k) =k∏

j=1

qµ(µj|mj, rj) ·k∏

j=1

qs(sj|aj, bj) · q~w(~w|~α) ·n∏

i=1

qz(zi|hi,1, . . . , hi,k)

with

qµ(µj|mj, rj) =1

√2πrj

e− 1

2

(µj−mj)2

rj

qs(sj|aj, bj) =bjaj

Γ(aj)s−aj−1j e

− bj

sj

q~w(~w|~α) =Γ(∑k

j=1 αj)∏k

j=1 Γ(αj)

k∏

j=1

wαj−1j

qz(zi|hi,1, . . . , hi,k) = hi,zi



After some (more or less complicated) calculations we obtain as update rules

mj ←bjm0 + r0ajnjx̄j

bj + r0ajnj

rj ←r0bj

bj + r0ajnj

aj ← a0 +nj

2

bj ← b0 +nj

2

(Vx,j + (mj − x̄j)2 + rj

)

αj ← βj + nj

hi,j ← c · eψ(αj)+12ψ(aj)− 1

2log(bj)− 1

2

aj

bj((xi−mj)

2+rj)c normalizes

k∑

j=1

hi,j to 1

with

nj =n∑

i=1

hi,j x̄j =1

nj

n∑

i=1

hi,jxj Vx,j =1

nj

n∑

i=1

hi,jx2j − x̄2j

ψ denotes the digamma function ψ(x) =ddx

Γ(x)

Γ(x)



Example→ Matlab demo

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2

1.4iteration = 300 k = 30 n = 1000 Plot shows MAP estimate after

variational inference. Sample ofsize 1000 is taken from a uniformdistribution. Priors were set closeto non-informativity.


Summary

◮ Kullback-Leibler-divergence

◮ principle of variational Bayes and theoretical derivation

◮ Example: variational Bayes for a Gaussian

◮ mixture distributions

◮ Example: variational Bayes for Gaussian mixtures


Chapter 06: Variational Bayesian...

Documents

Transcript of Chapter 06: Variational Bayesian...