Post on 05-Jul-2020
LEARNING AND INFERENCE IN GRAPHICAL MODELS
Chapter 07: Monte Carlo Methods
Dr. Martin Lauer
University of FreiburgMachine Learning Lab
Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems
Learning and Inference in Graphical Models. Chapter 07 – p. 1/44
References for this chapter
◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 11,Springer, 2006
◮ Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I.Jordan, An Introduction to MCMC for Machine Learning, In: MachineLearning, vol. 50, no. 1–2, pp. 5-43, 2003
◮ Christian P. Robert and George Casella, Monte Carlo Statistical Methods,Springer, 1999
◮ Radford M. Neal, Slice sampling, In: Annals of Statistics, vol. 31, no. 3, pp.705-767, 2003
◮ Darrall Henderson, Sheldon H. Jacobson, and Alan W. Johnson, The Theoryand Practice of Simulated Annealing, In: Fred Glover and Gary A.Kochenberger (eds.), Handbook of Metaheuristics, Springer, 2003
Learning and Inference in Graphical Models. Chapter 07 – p. 2/44
References for this chapter
◮ Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth,Augusta H. Teller, and Edward Teller, Equations of State Calculations by FastComputing Machines, In: The Journal of Chemical Physics, vol. 21, pp.1087–1092, 1953
◮ W. Keith Hastings, Monte Carlo Sampling Methods Using Markov Chains andTheir Applications, In: Biometrika, col. 57, pp. 97-109, 1970
◮ Stuart Geman and Donald Geman, Stochastic Relaxation, GibbsDistributions and the Bayesian Restauration of Images, In: IEEETransactions in Pattern Analysis and Machine Intelligence, vol. 6, pp.721-741, 1984
◮ Donald E. Knuth, The Art of Computer Science: Volume 2 SeminumericalAlgorithms, Addison-Wesley, 1997
◮ William Feller, An Introduction to Probability Theory and its Applications, vol.1, Wiley, 1968
Learning and Inference in Graphical Models. Chapter 07 – p. 3/44
Monte Carlo inference
◮ many tasks in probability theory deal with terms of the form:
∫
R
f(x)p(x)dx
• e.g. expectation value∫
xp(x)dx
• e.g. variance∫
x2p(x)dx− (∫
xp(x)dx)2
• e.g. expected risk∫
risk(x)p(x)dx
• e.g. expected gain∫
gain(x)p(x)dx
◮ but:
• integral often not tractable analytically
• p(·) often not known explicitly
◮ hence: replace analytical calculation by numerical approach→ Monte Carlo approach
Learning and Inference in Graphical Models. Chapter 07 – p. 4/44
Monte Carlo inference
◮ basic idea: calculate with random samples instead of pdfs
p(·) −→sample
{x1, . . . , xN} ∼ p
| ||× |↓ ↓
∫
Rf(x)p(x)dx ←−
approximates
1N
∑N
i=1 f(xi)
◮ as long as N is large enough 1N
∑N
i=1 f(xi) is a good approximation for∫
Rf(x)p(x)dx
◮ but: you need a random number generator for p
Learning and Inference in Graphical Models. Chapter 07 – p. 5/44
Random number generators
◮ random number generators for uniform distribution U(0, 1):many different algorithms exist, cf. book of Knuth
◮ quantile trick:
• assume, F (x) =∫ x
−∞p(t)dt is known (“cumulative distribution
function”, cdf)
• if u is a random sample element from U(0, 1), then F−1(u) is a randomsample element of p
• F−1 is called “quantile function”
◮ distribution specific transformation tricks:
• e.g. sampling from a Gaussian:assume u1, u2 independent random samples from U(0, 1). Then,
v1 =√−2 log u1 · sin(2πu2) and v2 =
√−2 log u1 · cos(2πu2) are
independent random variables fromN (0, 1)
Learning and Inference in Graphical Models. Chapter 07 – p. 6/44
Random number generators
◮ what can we do if we do not find any trick?→ accept-reject sampling
◮ assume:
• we want to sample from distribution with pdf p
• we own a random number generator for distribution with pdf q
• we know a constant M such that M ·q(x) ≥ p(x) for all x
• how can we use the random number generator for q to sample from p?
p
q
x1x2
p(x1)q(x1)
q(x2)p(x2)
Learning and Inference in Graphical Models. Chapter 07 – p. 7/44
Accept-reject sampling
p M ·q
x
p(x)
M ·q(x)
◮ sample from q yields x
◮ accept x with probabilityp(x)
Mq(x)
◮ otherwise, reject x
the set of all accepted sample elements x yields a sample from p since:
p(x) ∝ 1
Mp(x) = q(x) · p(x)
Mq(x)
on average the algorithm accepts only a ratio of 1M
of all sample elements.
→ choose appropriate q so that M remains small
Extension:accept-reject sampling works even if p is only known up to a constant factor
Learning and Inference in Graphical Models. Chapter 07 – p. 8/44
Example: robot localization
◮ robot is located within an area ofsize 1m× 1m
◮ it has sensors to measure thedistance to the corners of the field
d1d2
d3d4
~x
~e1 ~e2
~e3 ~e4
◮ Bayesian network
σ ~x
d1 d2 d3 d4
◮ distributions
~x ∼ U([0, 1]× [0, 1])
di|~x ∼ N (||~x− ~ei||, σ2)
~x|d1, d2, d3, d4 ∼ ?
Learning and Inference in Graphical Models. Chapter 07 – p. 9/44
Example: robot localization
◮ calculating the posterior
p(~x|d1, d2, d3, d4) ∝ p(d1|~x)p(d2|~x)p(d3|~x)p(d4|~x) · p(~x)
∝ exp(
− 1
2σ2
4∑
i=1
(||~x− ~ei|| − di)2)
· I[0≤x1,x2≤1]
≤ 1
◮ apply accept-reject sampling with M = 1, q(~x) = I[0≤x1,x2≤1]
Learning and Inference in Graphical Models. Chapter 07 – p. 10/44
Example: robot localization
◮ results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0.5, 0.5)rejected: 3627accepted: 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0.3, 0.2)rejected: 3087accepted: 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0, 0)rejected: 5792accepted: 100
◮ question: how can we create more efficient sampling schemes?
Learning and Inference in Graphical Models. Chapter 07 – p. 11/44
Side topic:Markov chains and their properties
Learning and Inference in Graphical Models. Chapter 07 – p. 12/44
Markov chains
Definitions:
◮ A Markov model or Markov chain is a Bayesian network which is organizedas a chain of random variables Xi where Xi+1 solely depends on Xi.
X1 X2
· · ·Xi
· · ·Xn
◮ The set of values that Xi can take is called the state set S.
◮ The transition between subsequent states is given by a transition kernelT (Xi+1|Xi)
• T (Xi+1|Xi) is a conditional probability if state set is discrete
• T (Xi+1|Xi) is a conditional density if state set is continuous
For the moment, we focus on Markov chains with finite, discrete state set.
◮ A Markov chain is homogeneous if transition kernel T is invariant w.r.t. time
Learning and Inference in Graphical Models. Chapter 07 – p. 13/44
Markov chains
◮ A transition diagram for a homogeneous Markov chain is a directed graphwith one vertex for each state and an edge between vertex u and v ifT (v|u) > 0
s3 s4
s1 s2
Example A
s3 s4
s1 s2
Example B
s3 s4
s1 s2
Example C
Learning and Inference in Graphical Models. Chapter 07 – p. 14/44
Markov chains
◮ A homogeneous Markov chain is irreducible if for all states u, v exists asequence of states s1, . . . , sn with u = s1 and sn = v such thatT (si+1|si) > 0.
◮ The period of a state s is given by
gcd{N ∈ N| exist s1, . . . , sN−1 with T (s1|s) > 0 and T (s|sN−1) > 0
and T (si+1|si) > 0 for all i ∈ {1, . . . , N − 2}}◮ A homogeneous Markov chain is aperiodic if the period of all states is 1.
s3 s4
s1 s2
Example A
s3 s4
s1 s2
Example B
s3 s4
s1 s2
Example C
Learning and Inference in Graphical Models. Chapter 07 – p. 15/44
Ergodic Markov chains
◮ Given a homogeneous Markov chain with discete, finite state set S we canarrange all transition probabilities in a transition matrix M withMi,j = T (sj|si). Hence, each row of M is the probability vector of acategorical distribution over S.
◮ Given a transition matrix M and a categorical distribution over the state setwith probability vector (row vector) ~w we obtain the distribution of successorstates by ~w ·M .
◮ a categorical distribution with probability vector ~w is a stationary distributionof a Markov chain with transition matrix M if ~w ·M = ~w.
◮ A homogeneous Markov chains with discrete, finite state set S is ergodic if
limk→∞ Mk exists and all rows in limk→∞ Mk are identical andlimk→∞ Mk does not contain zeros. Then the rows in limk→∞ Mk formthe probability vector of a stationary categorical distribution over S.
Learning and Inference in Graphical Models. Chapter 07 – p. 16/44
Ergodic Markov chains
s3 s4
s1 s2
Example A
s3 s4
s1 s2
Example B
s3 s4
s1 s2
Example C
MA=
310
710
0 0
0 0 910
110
12
0 0 12
0 0 1 0
, MB=
0 1 0 0
0 0 0 112
0 0 12
0 0 1 0
, MC=
1 0 0 0
0 0 13
23
12
0 0 12
0 0 1 0
Which of these Markov chains are ergodic?
Learning and Inference in Graphical Models. Chapter 07 – p. 17/44
Ergodic Markov chains
◮ Theorem: if a homogeneous Markov chain with discrete, finite state set ifirreducible and aperiodic it is also ergodic.Proof: see literature, e.g. Feller 1968
◮ What happens if we sample a very long sequence from an ergodic Markovchain?
• the first part of the sample will depend on the initial state (burn in phase)
• after burn in the sample is drawn from the stationary distribution of theMarkov chain
• the sample elements are dependent on each other
Learning and Inference in Graphical Models. Chapter 07 – p. 18/44
Ergodic Markov chains
Good and bad mixing behavior of a Markov chain
MA=
(
12
12
12
12
)
, MB=
(
99100
1100
1100
99100
)
s1 s2
Both Markov chains share the same stationary distribution, however, mixing isvery different.E.g. a random sample from chain A:s1, s1, s2, s1, s2, s2, s2, s1, s1, s2, s1, s1, s2, s2, s1, s2, . . .
and a sample from chain B:s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s2, s2, s2, s2, s2, s2, s2,
s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2,
s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s1, s1, s1,
s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, . . .
Learning and Inference in Graphical Models. Chapter 07 – p. 19/44
Markov chains with continuous state space
If S ⊆ Rd we have to replace the transition matrix by a transition kernel
T (Xt+1|Xt), i.e. a conditional probability density.
E.g. T (v|u) = 12πe−
12(u+1−v)2
Now, a stationary distribution a pdf p(·) with∫
T (v|u) · p(u)du = p(v)
Most results (especially those about ergodic chains) can be transferred fromdiscrete state spaces to continuous state spaces. For details, cf. the book ofChristian P. Robert & George Casella. Monte Carlo Statistical Methods. Springer,1999
Learning and Inference in Graphical Models. Chapter 07 – p. 20/44
Designing Markov chains
We want to create a Markov chain with a specific stationary distribution p(·). Howcan we design the transition kernel?
Theorem: If a transition kernel T meets the detailed balance equation
T (v|u) · p(u) = T (u|v) · p(v)
for all states u, v ∈ S then p is stationary distribution of T . In this case T iscalled reversible.
Proof:∫
T (v|u)p(u)du=
∫
T (u|v)p(v)du =
∫
T (u|v)du · p(v) = p(v)
Learning and Inference in Graphical Models. Chapter 07 – p. 21/44
Designing Markov chains
Theorem: If T1 and T2 are transition kernels with stationary distribution p, thenT = T2 ◦ T1 is a transition kernel with stationary distribution p. T is defined as
T (v|u) =∫
T2(v|w) · T1(w|u)dw
X1
X ′1
X2
X ′2
X3
X ′3
X4
· · ·T1 T2 T1 T2 T1 T2
Proof:∫
T (v|u)p(u)du=
∫ ∫
T2(v|w)T1(w|u)dw p(u)du
=
∫
T2(v|w)∫
T1(w|u)p(u)du dw
=
∫
T2(v|w)p(w)dw = p(v)
Learning and Inference in Graphical Models. Chapter 07 – p. 22/44
Designing Markov chains
Theorem: If T1 and T2 are transition kernels with stationary distribution p and0 < q < 1, then T = q · T1 + (1− q) · T2 is a transition kernel with stationary
distribution p. T is defined as T (v|u) = q · T1(v|u) + (1− q) · T2(v|u)
X1
X ′1
X ′′1
X2
X ′2
X ′′2
X3
X ′3
X ′′3
X4
· · ·T1 T1 T1
T2 T2 T2
Proof:∫
T (v|u)p(u)du=
∫
(q · T1(v|u) + (1− q) · T2(v|u)) p(u)du
= q ·∫
T1(v|u)p(u) du+ (1− q) ·∫
T2(v|u)p(u) du = p(v)
Learning and Inference in Graphical Models. Chapter 07 – p. 23/44
Markov Chain Monte Carlo Sampling
Learning and Inference in Graphical Models. Chapter 07 – p. 24/44
Markov chain Monte Carlo
◮ task
• we want to sample from a distribution p
• standard sampling tricks are not applicable
◮ basic idea:
• design a Markov chain with stationary distribution p
• sample from Markov chain. Reject initial sample elements
• obtain dependent sample from target distribution p
burn in: distribution dependson initial state
almost stationary target distribution
x0
◮ approach is known as Markov chain Monte Carlo sampling (MCMC)
• Metropolis-Hastings algorithm (Metropolis, 1953), (Hastings, 1970)
• Gibbs sampling (Geman and Geman, 1984)
• Slice sampling (Neal, 2003)Learning and Inference in Graphical Models. Chapter 07 – p. 25/44
Metropolis-Hastings algorithm
◮ basic idea:
• sample candidates for successor state using a distribution q
• apply detailed balance equation to calculate an acceptance probability
◮ principle:
xt zt
p
q(·|xt) q(·|zt)
transition:
• sample zt ∼ q(·|xt)
• set xt+1 = zt withprobability
min{
1, p(zt)·q(xt|zt)p(xt)·q(zt|xt)
}
• otherwise set xt+1 = xt
◮ the acceptance probability simplifies if q is symmetric: min{
1, p(z)p(x)
}
(Metropolis algorithmus)
Learning and Inference in Graphical Models. Chapter 07 – p. 26/44
Metropolis-Hastings algorithm
The transition kernel of the Metropolis-Hastings algorithm is
T (v|u) = q(v|u) · A(v|u) + δ(v − u) ·∫
q(w|u) · (1− A(w|u))dw
with A(v|u) = min{
1, p(v)·q(u|v)p(u)·q(v|u)
}
Lemma:The Metropolis-Hastings transition kernel meets the detailed balance equation.
Proof:→ blackboard
Remark: Metropolis-Hastings also works if the target probability is only known upto a normalization constant
Learning and Inference in Graphical Models. Chapter 07 – p. 27/44
Example: robot localization revisited
◮ robot localization example solved with Metropolis-Hastings algorithmsample distribution q: ~z ∼ ~x+ U([−0.1, 0.1]× [−0.1, 0.1])
◮ created samples (each 200 elements):
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0.5, 0.5)93 candidates rejected
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0.3, 0.2)82 candidates rejected
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
~x = (0, 0)106 candidates rejected
Learning and Inference in Graphical Models. Chapter 07 – p. 28/44
Gibbs sampling
◮ sampling over bivariate distribution ~x = (x1, x2)
◮ Metropolis-Hastings with q(z1, z2|x1, x2) = I[x1=z1]p(z2|x1)
p(z1, z2) · q(x1, x2|z1, z2)p(x1, x2) · q(z1, z2|x1, x2)
=p(z1, z2) · I[x1=z1] · p(x2|z1)p(x1, x2) · I[x1=z1] · p(z2|x1)
=p(x1, z2) · p(x2|x1)
p(x1, x2) · p(z2|x1)=
p(z2|x1) · p(x1) · p(x2|x1)
p(x2|x1) · p(x1) · p(z2|x1)= 1
i.e. x1 is clamped while x2 ∼ p(x2|x1) is sampled
◮ analogously: clamp x2 and sample x1 ∼ p(x1|x2)
◮ Gibbs sampling: concatenate both steps
Learning and Inference in Graphical Models. Chapter 07 – p. 29/44
Example: uniform distribution over a parabola
◮ sample from a uniform distributionover a frustum of a parabola
◮ p(x1, x2) ∝ I[−2≤x1≤2]I[x21≤x2≤4]
x2|x1 ∼ U(x21, 4)
x1|x2 ∼ U(−√x2,√x2)
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
x1
x2
xt
xt+1
xt+2
x̃(t)
x̃(t+1)
Learning and Inference in Graphical Models. Chapter 07 – p. 30/44
Gibbs sampling
generalization to multivariate distributions p(x1, . . . , xd):
◮ sample x(t+1)1 ∼ p(x1|x(t)
2 , . . . , x(t)d )
◮ sample x(t+1)2 ∼ p(x2|x(t+1)
1 , x(t)3 , . . . , x
(t)d )
...
◮ sample x(t+1)d ∼ p(xn|x(t+1)
1 , . . . , x(t+1)d−1 )
Learning and Inference in Graphical Models. Chapter 07 – p. 31/44
Example: bearing-only tracking
◮ observing a moving object from a fixed position
◮ object moves with constant velocity
◮ for every point in time, observer senses angleof observation, but only sometimes distance toobject
◮ distributions:~x0 ∼ N (~a,R)
~v ∼ N (~b, S)
~yi|~x0, ~v ∼ N (~x0 + ti~v, σ2I)
ri = ||~yi||wi =
~yi
||~yi||
object movement
unknowndistance
angle ofobservation
observer
unknown~x0 ~v
ri
~wi
����������
����������
ti
σ
~x0
~yi
~v
~wi
ri
n
Learning and Inference in Graphical Models. Chapter 07 – p. 32/44
Example: bearing-only tracking
◮ conditional distributions for ~x0, ~v:
~x0|~v, (~yi), (ti) ∼ N(
(n
σ2I +R−1)−1(
1
σ2
∑
(~yi − tiv) +R−1~a), (n
σ2I +R−1)−1
)
~v|~x0, (~yi), (ti) ∼ N(
(1
σ2
∑
t2i I + S−1)−1(1
σ2(∑
ti(~yi − ~x0)) + S−1~b), (1
σ2
∑
t2i I + S−1)−1)
◮ conditional distribution for ri:p(~yi|~x0, ~v, ti) ∝ exp{− 1
2σ2||~x0 + ti~v − ~yi||2}
p(~yi|~x0, ~v, ti, ~wi) ∝
exp{− 12σ2
||~x0 + ti~v − ~yi||2} · I[~yi‖~wi]
p(ri|~x0, ~v, ti, ~wi) ∝
exp{− 12σ2
||~x0 + ti~v − ri ~wi||2} =
exp{− 12σ2
(||~x0+ti~v||2−2ri · ~wTi (~x0+ti~v)+r2i )} ∝
exp{− 12σ2
(ri − ~wTi (~x0 + ti~v))
2}
⇒ ri|~x0, ~v, ti, ~wi ∼ N (~wTi (~x0 + ti~v), σ
2)
Learning and Inference in Graphical Models. Chapter 07 – p. 33/44
Example: bearing-only tracking
◮ Gibbs sampling with non-informative priors (R−1 = S−1 = 0):
~x0 ∼ N ( 1n
∑
(~yi − ti~v),σ2
nI)
~v ∼ N (∑
ti(~yi−~x0)∑
t2i, σ2∑
t2iI)
ri ∼ N (~wTi (~x0 + ti~v), σ
2)
◮ results and Matlab demo
−5 0 5 10 15 20 250
2
4
6
8
10iteration=1000
Learning and Inference in Graphical Models. Chapter 07 – p. 34/44
Gibbs sampling for Gaussian mixture
k
n
m0 r0 a0 b0
µj sj
Xi
Zi
~w
~β
µj ∼N (m0, r0)
sj ∼ Γ−1(a0, b0)
~w ∼D(~β)Zi|~w ∼ C(~w)
Xi|Zi, µZi, sZi∼N (µZi
, sZi)
Learning and Inference in Graphical Models. Chapter 07 – p. 35/44
Gibbs sampling for Gaussian mixture
calculate conditionals for Gibbs sampling using the results about conjugatedistributions (chapter 2)
~w|z1, . . . , zn, ~β ∼ D(β1 + n1, . . . , βk + nk) with nj = |{i|zi = j}|
µj|x1, . . . , xn, z1, . . . , zn, sj,m0, r0 ∼ N (sjm0 + r0
∑
i|zi=j xi
sj + njr0,
r0sj
sj + njr0)
sj|x1, . . . , xn, z1, . . . , zn, µj, a0, b0 ∼ Γ−1(a0 +nj
2, b0 +
1
2
∑
i|zi=j
(xi − µj)2)
zi = j|~w, xi, µ1, . . . , µk, s1, . . . , sk ∼ C(hi,1, . . . , hi,k)
with hi,j ∝wj
√
2πsje− 1
2
(xi−µj)2
sj
Learning and Inference in Graphical Models. Chapter 07 – p. 36/44
Gibbs sampling for Gaussian mixture
Example→ Matlab demo
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.2
0.4
0.6
0.8
1
1.2
1.4iteration = 116 k = 3 n = 1000 Plot shows sampled mixture after
1000 iterations of Gibbs sampling.Sample of size 1000 is taken froma uniform distribution. Priors wereset close to non-informativity.
Learning and Inference in Graphical Models. Chapter 07 – p. 37/44
Slice sampling
We want to sample from a distributionwith density p(x)
Extend distribution by a secondvariable u
p′(x, u) =
{
1 if 0 ≤ u ≤ p(x)
0 otherwise
p′ is a pdf since it is nonnegative and∫ ∫
p′(x, u)du dx = 1
p(x)
x
x
p′(x, u) u
Apply Gibbs sampling to p′:
u|x∼ U(0, p(x))x|u∼ U{x′|p(x′) ≥ u}We obtain a sample of p′. Since p is marginal of p′ we obtain a sample of p bysampling from p′ and forgetting about the ui.
Learning and Inference in Graphical Models. Chapter 07 – p. 38/44
Slice sampling
Executing slice sampling on the example:
x
u
(x1, u1) (x2, u1)
(x2, u2)(x3, u2)
(x3, u3)(x4, u3)
(x4, u4)
The crucial point in slice sampling is whether it is possible to determine the set{x′|p(x′) ≥ u} efficiently.
Slice sampling can also be used if p is only known up to a normalization factor.
Learning and Inference in Graphical Models. Chapter 07 – p. 39/44
Simulated annealing
Can we use MCMC if we want to calculate the MAP estimator of a distribution p?
Observation:
Consider the sequence of densities
p(x),1
Z2
(p(x))2,1
Z3
(p(x))3, . . .
What is the limit limν→∞1Zνpν?
Example:
p(x) =
{
2x if 0 ≤ x ≤ 1
0 otherwise
(p(x))ν =
{
(2x)ν if 0 ≤ x ≤ 1
0 otherwise
Zν =
∫ 1
0
(2x)νdx =2ν
ν + 1
limν→∞
1
Zν
pν(x) = limν→∞
{
(ν + 1)xν if 0 ≤ x ≤ 1
0 otherwise
= δ(x− 1) =
{
∞ if x = 1
0 otherwise
Learning and Inference in Graphical Models. Chapter 07 – p. 40/44
Simulated annealing
In general, if a density p(x) has a single global maximum at x = xmax, then the
sequence of 1Zν(p(x))ν converges pointwise to δ(x− xmax)
Hence, the larger ν, the more probable will a MCMC sampler focus on a smallsurrounding of xmax.
Let us build a Metropolis-Hastings sampler with symmetric proposal distribution
q(z|x). The acceptance probability is min{1, ( p(z)p(x)
)ν}To be consistent with literature, let us define t = 1
ν. Hence, the acceptance
probability is min{1, ( p(z)p(x)
)1t }. t is called the temperature.
Idea:While applying the Metropolis algorithm decrease the temperature t slowly overtime.→ simulated annealing
Learning and Inference in Graphical Models. Chapter 07 – p. 41/44
Simulated annealing
Goal: find the MAP of a probability distribution with density function p
1. initialize x arbitrarily
2. initialize temperature t = 1
3. repeat
4. sample a candidate z ∼ q(·|x)5. calculate acceptance probability A = min{1, ( p(z)
p(x))1t }
6. with probability A
7. set x← z
8. endif
9. decrease temperature t slighly
10. until convergence
Learning and Inference in Graphical Models. Chapter 07 – p. 42/44
Simulated annealing
Simulated annealing is guaranteed to find the MAP estimate with probability 1 if
◮ the Markov chain generated by proposal distribution q is ergodic for anychoice of t > 0
◮ the cooling scheme is sufficiently slow
Proof idea:
◮ since q generates an ergodic Markov chain it will sample from the full
distribution 1Z 1
t
(p(x))1t after a certain burnin period if we keep t constant.
◮ the smaller t is, the more time the Markov chain will stay in the closesurrounding of xmax
◮ since 1Z 1
t
(p(x))1t → δ(x− xmax) the Markov chain will converge to xmax
Background remark:Simulated annealing was motivated by the physical annealing of solids.
→ Matlab-demo robot localization Metropolis algorithm vs. simulated annealing
Learning and Inference in Graphical Models. Chapter 07 – p. 43/44
Summary
◮ Monte Carlo approximation
◮ accept reject sampling
• example: robot localization
◮ Markov chains
• ergodic Markov chains
• design of transition kernels
◮ Metropolis-Hastings algorithm
• example: robot localization
◮ Gibbs sampling
• example: bearing-only tracking
• example: Gaussian mixture
◮ slice sampling
◮ simulated annealing
Learning and Inference in Graphical Models. Chapter 07 – p. 44/44