Chapter 06: Variational Bayesian...
Transcript of Chapter 06: Variational Bayesian...
![Page 1: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/1.jpg)
LEARNING AND INFERENCE IN GRAPHICAL MODELS
Chapter 06: Variational Bayesian Inference
Dr. Martin Lauer
University of FreiburgMachine Learning Lab
Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems
Learning and Inference in Graphical Models. Chapter 06 – p. 1/38
![Page 2: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/2.jpg)
References for this chapter
◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 10,Springer, 2006
◮ Charles Fox and Stephen Roberts, A Tutorial on Variational BayesianInference, In: Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95, 2012
◮ John Winn and Christopher M. Bishop, Variational Message Passing, In:Journal of Machine Learning Research, vol. 6, pp- 661-694, 2005http://machinelearning.wustl.edu/mlpapers/paper_files/WinnB05.pdf
Learning and Inference in Graphical Models. Chapter 06 – p. 2/38
![Page 3: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/3.jpg)
Approximative solutions
Observations:
◮ inference on Bayesian networks can be done analytically for polytreescombined with special distributions (categorical, Gauss-linear)
◮ inference is not treatable analytically for the general case
◮ requires numerical or approximative solutions
Learning and Inference in Graphical Models. Chapter 06 – p. 3/38
![Page 4: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/4.jpg)
Approximative inference
Joint probability distribution for many Bayesian networks is pretty complicated andhard to treat analytically
Goal: find a simpler joint probability distribution that approximates the original oneand that can be treated analytically
Example:
n
µ
σ2
Xi
conjugate prior for Gaussians
n
µ
σ2
Xi
desirable, simpler prior for Gaussians
Learning and Inference in Graphical Models. Chapter 06 – p. 4/38
![Page 5: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/5.jpg)
Approximative inference
How can we measure whether two distributions are similar?
Definition: the Kullback-Leibler divergence is an unsymmetric measure for thedissimilarity of two probability distributions. It is defined by
KL(p||q) =∫ ∞
−∞p(x) log
p(x)
q(x)dx
Properties:
◮ KL(p||p) = 0
◮ KL(p||q) ≥ 0 with equality only if p = q almost everywhere
◮ KL(p||q) 6= KL(q||p)◮ KL(p||q) +KL(q||r) 6≥ KL(p||r)
Learning and Inference in Graphical Models. Chapter 06 – p. 5/38
![Page 6: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/6.jpg)
Kullback-Leibler-divergence
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
−10 −5 0 5 10
p(x)
argminq KL(p||q)
argminq KL(q||p)
Question: which Gaussian q minimizes the Kullback-Leibler-DivergenceKL(p||q) and KL(q||p), respectivly ?
Learning and Inference in Graphical Models. Chapter 06 – p. 6/38
![Page 7: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/7.jpg)
Variational Bayesian inference
◮ assume we are given a complex distribution p(U |O) where U are theunobserved variables and O the observed variables
◮ we want to approximate it with a parameterized distribution q(U |θ) with θthe set of parameters
◮ we assume that q can be factorized q(U |θ) = ∏
i qi(Ui|θi) with qi aconditional distribution
◮ how should we choose θ to obtain the best approximation?
minimizeθ
KL(q(U |θ)||p(U |O))
Learning and Inference in Graphical Models. Chapter 06 – p. 7/38
![Page 8: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/8.jpg)
Variational Bayesian inference
KL(q(U |θ)||p(U |O)) =∫
q(u|θ) log q(u|θ)p(u|O)du
=−∫
q(u|θ) log p(u|O)q(u|θ) du
=−∫
q(u|θ) log p(u,O)
p(o) · q(u|θ)du
=−∫
q(u|θ) log p(u,O)q(u|θ) −q(u|θ) log p(o)du
=−∫
q(u|θ) log p(u,O)q(u|θ) du
︸ ︷︷ ︸
=:L(θ)
+ log p(o)
Observe that p(o) is independent w.r.t. θ.
Hence, we need to maximize L(θ)
Learning and Inference in Graphical Models. Chapter 06 – p. 8/38
![Page 9: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/9.jpg)
Variational Bayesian inference
Use the factorization q(U |θ) = ∏
i qi(Ui|θi):
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−∫
(n∏
i=1
qi(ui|θi))(n∑
j=1
log qj(uj|θj))d(u1, . . . , un)
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
(n∏
i=1
qi(ui|θi)) log qj(uj|θj)d(u1, . . . , un))
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∏
i 6=j
(∫
qi(ui|θi)dui)·∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 9/38
![Page 10: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/10.jpg)
Variational Bayesian inference
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∏
i 6=j
(∫
qi(ui|θi)dui)
︸ ︷︷ ︸
=1
·∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 10/38
![Page 11: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/11.jpg)
Variational Bayesian inference
Select one (arbitrary) factor k:
L(θ) =
∫
(n∏
i=1
qi(ui|θi)) log p(u,O)d(u1, . . . , un)
−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk)(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
)
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk)log(
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
))
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
Learning and Inference in Graphical Models. Chapter 06 – p. 11/38
![Page 12: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/12.jpg)
Variational Bayesian inference
L(θ) =
∫
qk(uk|θk) log(
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1
duk+1 . . . dun
))
duk −n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk) log q∗k(uk)duk+logZ−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
with
Z=
∫
exp(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
duk
q∗k(uk)=1
Zexp
(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
Z serves as normalization constant so that q∗k becomes a density function of aGibbs distribution.
Learning and Inference in Graphical Models. Chapter 06 – p. 12/38
![Page 13: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/13.jpg)
Variational Bayesian inference
L(θ) =
∫
qk(uk|θk) log q∗k(uk)duk+logZ−n∑
j=1
(∫
qj(uj|θj) log qj(uj|θj)duj)
=
∫
qk(uk|θk) log q∗k(uk)duk −∫
qk(uk|θk) log qk(uk|θk)duk
+ logZ −∑
j 6=k
(∫
qj(uj|θj) log qj(uj|θj)duj)
=−KL(qk(uk|θk)||q∗k(uk)) + logZ −∑
j 6=k
(∫
qj(uj|θj) log qj(uj|θj)duj)
We want to maximize L. If we keep all θi fixed except of θk we should choose θkso that qk(uk|θk) = q∗k(uk)
If we apply this idea repeatedly cycling through all possible values of k, we obtainan iterative algorithm that converges to a local maximum of L(θ)
Learning and Inference in Graphical Models. Chapter 06 – p. 13/38
![Page 14: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/14.jpg)
Variational Bayesian inference
Algorithm:
1. start with arbitrary parameter set θ
2. repeat
3. for k ← 1, . . . , n do
4. select θk so that qk(uk|θk) = q∗k(uk)
5. endfor
6. until convergence of θ
7. return θ
Learning and Inference in Graphical Models. Chapter 06 – p. 14/38
![Page 15: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/15.jpg)
Variational Bayesian inference
A closer look at q∗k(uk)
q∗k(uk)=1
Zexp
(∫
· · ·∫
log p(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
)
If p was derived from a Bayesian network, we know that it factors into terms whichbelong to the Markov blanket and other terms, i.e.p(u,O) = p′(u,O) · p′′(u,O) and the second term does not depend on uk.
q∗k(uk) =1
Zexp
(∫
· · ·∫
log p′(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
+
∫
· · ·∫
log p′′(u,O)∏
i 6=kqj(uj|θj)du1 . . . duk−1duk+1 . . . dun
︸ ︷︷ ︸
constant w .r .t . uk
)
=1
Z ′ exp(∫
· · ·∫
log p′(u,O)∏
i∈blanket(k)
qj(uj|θj)d{ui|i ∈ blanket(k)})
Learning and Inference in Graphical Models. Chapter 06 – p. 15/38
![Page 16: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/16.jpg)
Example: Gaussian
Example: a sample from a Gaussian with unknown parameters
n
m0 r0 a0 b0
µ s
Xi
µ∼N (m0, r0)
s∼ Γ−1(a0, b0)
Xi ∼N (µ, s)
p(µ, s, x1, . . . , xn) =1√2πr0
e− 1
2(µ−m0)
2
r0
︸ ︷︷ ︸
=:dµ(µ)
·
ba00Γ(a0)
s−a0−1e−b0s
︸ ︷︷ ︸
=:ds(s)
·n∏
i=1
1√2πs
e−12
(xi−µ)2
s
︸ ︷︷ ︸
=:d(µ,s)
Learning and Inference in Graphical Models. Chapter 06 – p. 16/38
![Page 17: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/17.jpg)
Example: Gaussian
n
m0 r0 a0 b0
µ s
Xi
Modeling full posterior by variational approximation
q(µ, s|m, r, a, b) = qµ(µ|m, r) · qs(s|a, b)
qµ(µ|m, r) =1√2πr
e−12
(µ−m)2
r
qs(s|a, b) =ba
Γ(a)s−a−1e−
bs
Learning and Inference in Graphical Models. Chapter 06 – p. 17/38
![Page 18: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/18.jpg)
Example: Gaussian
q∗µ(µ)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds
)
q∗s(s)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qµ(µ|m, r)dµ
)
Learning and Inference in Graphical Models. Chapter 06 – p. 18/38
![Page 19: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/19.jpg)
Example: Gaussian
Side calculation:
d(µ, s) =n∏
i=1
1√2πs
e−12
(xi−µ)2
s
=1
(2π)n2 s
n2
e−12s
(∑x2i−2µ
∑xi+nµ
2)
=1
(2π)n2 s
n2
e−n2s
(µ2−2µ∑
xin
+∑
x2in
)
=1
(2π)n2 s
n2
e−n2s
(µ2−2µ∑
xin
+(∑
xin
)2−(∑
xin
)2+∑
x2in
)
=1
(2π)n2 s
n2
e− 1
2(µ−x̄)2+Vx
sn
with x̄ the mean and Vx the variance of Xi
x̄=1
n
∑
xi
Vx =1
n
∑
x2i − (1
n
∑
xi)2
Learning and Inference in Graphical Models. Chapter 06 – p. 19/38
![Page 20: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/20.jpg)
Example: Gaussian
q∗µ(µ)∝ exp
(∫ ∞
−∞log p(µ, s, x1, . . . , xn) · qs(s|a, b)ds
)
= exp
(∫ ∞
−∞log(dµ(µ) + ds(s) + d(µ, s)) · qs(s|a, b)ds
)
log q∗µ(µ) = const(µ) +
∫ ∞
−∞log dµ(µ)qs(s|a, b)ds
+
∫ ∞
−∞log ds(s) · qs(s|a, b)ds
︸ ︷︷ ︸
=const(µ)
+
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
= const(µ) + log dµ(µ)
∫ ∞
−∞qs(s|a, b)ds
︸ ︷︷ ︸
=1
+
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
Learning and Inference in Graphical Models. Chapter 06 – p. 20/38
![Page 21: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/21.jpg)
Example: Gaussian
∫ ∞
−∞log d(µ, s) · qs(s|a, b)ds
=
∫ ∞
−∞log
( 1
(2π)n2 s
n2
e− 1
2(µ−x̄)2+Vx
sn
)· ba
Γ(a)s−a−1e−
bsds
=
∫ ∞
−∞log
1
(2π)n2 s
n2
· ba
Γ(a)s−a−1e−
bsds
︸ ︷︷ ︸
=const(µ)
+
∫ ∞
−∞log e
− 12
(µ−x̄)2+Vxsn · ba
Γ(a)s−a−1e−
bsds
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vxsn
)· ba
Γ(a)s−a−1e−
bsds
Learning and Inference in Graphical Models. Chapter 06 – p. 21/38
![Page 22: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/22.jpg)
Example: Gaussian
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vxsn
)· ba
Γ(a)s−a−1e−
bsds
= const(µ) +
∫ ∞
−∞
(− 1
2
(µ− x̄)2 + Vx1n
)· Γ(a+ 1)
Γ(a) · b ·ba+1
Γ(a+ 1)s−(a+1)−1e−
bsds
= const(µ) +(− 1
2
(µ− x̄)2 + Vx1n
)· ab·∫ ∞
−∞
ba+1
Γ(a+ 1)s−(a+1)−1e−
bsds
︸ ︷︷ ︸
=1
= const(µ) +a
b·(− 1
2
(µ− x̄)21n
)+a
b·(− 1
2
Vx1n
)
︸ ︷︷ ︸
=const(µ)
= const(µ) +a
b·(− 1
2
(µ− x̄)21n
)
Learning and Inference in Graphical Models. Chapter 06 – p. 22/38
![Page 23: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/23.jpg)
Example: Gaussian
Assembling all pieces:
q∗µ(µ)∝ dµ(µ) · exp(a
b·(− 1
2
(µ− x̄)21n
))
∝ exp(
− 1
2
((µ−m0)2
r0+
(µ− x̄)2ban
))
= exp(
− 1
2
ban(µ2 − 2µm0 +m2
0) + r0(µ2 − 2µx̄+ x̄2)
r0ban
)
= exp(
− 1
2
(r0 +ban)µ2 − 2µ( bm0
an+ r0x̄) +
bm20
an+ r0x̄
2
r0ban
)
= exp(
− 1
2
(µ−bm0an
+r0x̄
r0+ban
)2−(bm0an
+r0x̄
r0+ban
)2 +bm2
0an
+r0x̄2
r0+ban
r0ban
r0+ban
)
Learning and Inference in Graphical Models. Chapter 06 – p. 23/38
![Page 24: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/24.jpg)
Example: Gaussian
q∗µ(µ)∝ exp(
− 1
2
(µ−bm0an
+r0x̄
r0+ban
)2−(bm0an
+r0x̄
r0+ban
)2 +bm2
0an
+r0x̄2
r0+ban
r0ban
r0+ban
)
∝ exp(
− 1
2
(µ− bm0+r0anx̄r0an+b
)2
r0br0an+b
)
Comapring q∗µ(µ) with parameterized form of qµ(µ|m, r) = 1√2πre−
12
(µ−m)2
r
yields
m← bm0 + r0anx̄
r0an+ b
r← r0b
r0an+ b
Learning and Inference in Graphical Models. Chapter 06 – p. 24/38
![Page 25: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/25.jpg)
Example: Gaussian
With a similar calculation, we obtain
q∗s(s)∝ s−(a0+n2)−1e−
b0+n2 (Vx+(m−x̄)2+r)
s
Comapring q∗s(s) with parameterized form of qs(s|a, b) = ba
Γ(a)s−a−1e−
bs yields
a← a0 +n
2
b← b0 +n
2(Vx + (m− x̄)2 + r)
Learning and Inference in Graphical Models. Chapter 06 – p. 25/38
![Page 26: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/26.jpg)
Example: Gaussian
Algorithm:
1. start with arbitrary values of m, r, a, b
2. repeat
3. set m← bm0+r0anx̄r0an+b
4. set r ← r0br0an+b
5. set a← a0 +n2
6. set b← b0 +n2(Vx + (m− x̄)2 + r)
7. until convergence of (m, r, a, b)
8. return θ
Learning and Inference in Graphical Models. Chapter 06 – p. 26/38
![Page 27: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/27.jpg)
Example: Gaussian
Experiment:
◮ generate sample (n = 20) fromN (10, 9)
◮ use priors for µ and s close to non-informativity
◮ apply 10 iterations of variational Bayes
0 5 10 15 20 250
0.02
0.04
0.06
0.08
0.1
0.12
0.14 blue: original sample distributionblack: sample pointsgreen: ML estimatorred: MAP estimator aftervariational Bayes
Learning and Inference in Graphical Models. Chapter 06 – p. 27/38
![Page 28: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/28.jpg)
Example: Gaussian
Comparison: full posterior vs. variational posterior
2
7
9 15
√s
µ
full posterior
2
7
9 15
√s
µ
variational approximation
Learning and Inference in Graphical Models. Chapter 06 – p. 28/38
![Page 29: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/29.jpg)
Example: Gaussian
◮ bad experience: the calculations are lengthy and error prone
◮ good news: the equations can be resolved whenever the distributionsinvolved are from the exponential family, e.g. Gaussian, Gamma, InverseGamma, Wishart, Inverse Wishart, Dirichlet, Beta, Categorical.
◮ semi-good news: there is a message passing algorithm (variational messagepassing) that implements variational inference for exponential familydistributions. But it is hard to understand and sometimes it is easier to do allthe calculations manually (Winn, 2005)
Learning and Inference in Graphical Models. Chapter 06 – p. 29/38
![Page 30: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/30.jpg)
Mixture distributions
How can we model distributions that are different and more complex thanstandard distribution families?
◮ search the literature for other distributions
◮ combine distributions
• combine distributions of different kind (unusual)
• combine distributions of same kind
→ mixture distributions, e.g. mixture of Gaussian, mixture of Dirichlet, ...
Learning and Inference in Graphical Models. Chapter 06 – p. 30/38
![Page 31: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/31.jpg)
Mixture distributions
How do we combine distributions to a mixture?
Requirements for a density function:
◮ f(x) ≥ 0 for all x ∈ R
◮
∫∞−∞ f(x)dx = 1
Observation: If f and g are pdfs and 0 < w < 1, thenx 7→ w · f(x) + (1− w)g(x) is also a pdf.
More general, if f1, . . . , fk are pdfs and w1, . . . , wk nonnegative numbers with∑k
j=1wj = 1, then x 7→∑k
j=1(wjfj(x)) is a pdf. Such a distribution is called
a mixture distribution.
◮ fj is the j-th component of the mixture
◮ wj serves as mixing weight and models the amount of contribution of fj tothe mixture
Learning and Inference in Graphical Models. Chapter 06 – p. 31/38
![Page 32: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/32.jpg)
Mixture distributions
Interpretation of mixture distributions as structured distributions
◮ each component models one class (category) which is descibed by fj
◮ class j contributes with a ratio of wj to the whole
◮ each sample element xi of a mixture belongs to one component. But, we donot know to which one
Introduce latent variables zi which models to which component xi belongs
k
n
fj
Xi
Zi
~w
Zi ∼ C(~w)Xi|Zi ∼ fZi
Learning and Inference in Graphical Models. Chapter 06 – p. 32/38
![Page 33: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/33.jpg)
Example: Gaussian mixture
k
n
µj sj
Xi
Zi
~w
Zi ∼ C(~w)Xi|Zi ∼N (µZi
, sZi)
How can we apply variational Bayesian inference in Gaussian mixtures?
Learning and Inference in Graphical Models. Chapter 06 – p. 33/38
![Page 34: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/34.jpg)
Variational Bayes for Gaussian mixture
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~βµj ∼N (m0, r0)
sj ∼ Γ−1(a0, b0)
~w ∼D(~β)Zi|~w ∼ C(~w)
Xi|Zi, µZi, sZi∼N (µZi
, sZi)
p(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, x1, . . . , xn, z1, . . . , zn)
=k∏
j=1
( 1√2πr0
e− 1
2
(µj−m0)2
r0
)·
k∏
j=1
( ba00Γ(a0)
s−a0−1j e
− b0sj
)
·Γ(∑k
j=1 βj)∏k
j=1 Γ(βj)
k∏
j=1
wβj−1j ·
n∏
i=1
wzi ·n∏
i=1
( 1√2πszi
e− 1
2
(xi−µzi)2
szi
)
Learning and Inference in Graphical Models. Chapter 06 – p. 34/38
![Page 35: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/35.jpg)
Variational Bayes for Gaussian mixture
Modeling full posterior by variational approximation
q(µ1, . . . , µk, s1, . . . , sk, w1, . . . , wk, z1, . . . , zn|m1, . . . ,mk,
r1, . . . , rk, a1, . . . , ak, b1, . . . , bk, α1, . . . , αk, h1,1, . . . , hn,k) =k∏
j=1
qµ(µj|mj, rj) ·k∏
j=1
qs(sj|aj, bj) · q~w(~w|~α) ·n∏
i=1
qz(zi|hi,1, . . . , hi,k)
with
qµ(µj|mj, rj) =1
√2πrj
e− 1
2
(µj−mj)2
rj
qs(sj|aj, bj) =bjaj
Γ(aj)s−aj−1j e
− bj
sj
q~w(~w|~α) =Γ(∑k
j=1 αj)∏k
j=1 Γ(αj)
k∏
j=1
wαj−1j
qz(zi|hi,1, . . . , hi,k) = hi,zi
Learning and Inference in Graphical Models. Chapter 06 – p. 35/38
![Page 36: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/36.jpg)
Variational Bayes for Gaussian mixture
After some (more or less complicated) calculations we obtain as update rules
mj ←bjm0 + r0ajnjx̄j
bj + r0ajnj
rj ←r0bj
bj + r0ajnj
aj ← a0 +nj
2
bj ← b0 +nj
2
(Vx,j + (mj − x̄j)2 + rj
)
αj ← βj + nj
hi,j ← c · eψ(αj)+12ψ(aj)− 1
2log(bj)− 1
2
aj
bj((xi−mj)
2+rj)c normalizes
k∑
j=1
hi,j to 1
with
nj =n∑
i=1
hi,j x̄j =1
nj
n∑
i=1
hi,jxj Vx,j =1
nj
n∑
i=1
hi,jx2j − x̄2j
ψ denotes the digamma function ψ(x) =ddx
Γ(x)
Γ(x)
Learning and Inference in Graphical Models. Chapter 06 – p. 36/38
![Page 37: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/37.jpg)
Variational Bayes for Gaussian mixture
Example→ Matlab demo
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
1.2
1.4iteration = 300 k = 30 n = 1000 Plot shows MAP estimate after
variational inference. Sample ofsize 1000 is taken from a uniformdistribution. Priors were set closeto non-informativity.
Learning and Inference in Graphical Models. Chapter 06 – p. 37/38
![Page 38: Chapter 06: Variational Bayesian Inferenceml.informatik.uni-freiburg.de/_media/teaching/ws1314/gm/... · 2013-12-13 · Approximative solutions Observations: inference on Bayesian](https://reader030.fdocuments.us/reader030/viewer/2022040122/5f3264af3660a322951cfe48/html5/thumbnails/38.jpg)
Summary
◮ Kullback-Leibler-divergence
◮ principle of variational Bayes and theoretical derivation
◮ Example: variational Bayes for a Gaussian
◮ mixture distributions
◮ Example: variational Bayes for Gaussian mixtures
Learning and Inference in Graphical Models. Chapter 06 – p. 38/38