Post on 25-Feb-2022
Approximate inference:
Sampling methods
Probabilistic Graphical Models
Sharif University of Technology
Spring 2017
Soleymani
Approximate inference
2
Approximate inference techniques
Deterministic approximation
Variational algorithms
Stochastic simulation / sampling methods
Sampling-based estimation
3
Assume that π = π₯(1), β¦ , π₯(π) shows the set of i.i.d. samples drawn fromthe desired distribution π
For any distribution π, function π, we can estimate πΈπ π :
πΈπ π β1
π
π=1
π
π π₯ π
Expectations reveal interesting properties about distribution π Means and variance of π Probability of events
E.g., we can find π(π₯ = π) by estimating πΈπ π where π π₯ = πΌ π₯ = π
We can use a stochastic representation of a complex distribution
Empirical
expectation
Bounds on error
4
Hoeffding bound
additive bound on error
ππ π β π β π, π + π β€ 2πβ2ππ2
Chernouf bound
ππ π β π 1 β π , π 1 + π β€ 2πβ
πππ2
3
multiplicative bound on error
π =1
π π=1
π π π(π) where π(π)~π(π)
π = πΈπ π π π
The mean and variance of the estimator
5
For samples drawn independently from the distribution π:
π =1
π
π=1
π
π π₯ π
πΈ π = πΈ π
π£ππ π =1
πΏπΈ π β πΈ π 2
Monte Carlo methods
6
Using a set of samples to find the answer of an inference query
expectations can be approximated using sample-based averages
Asymptotically exact and easy to apply to arbitrary
problems
Challenges:
Drawing samples from many distributions is not trivial
Are the gathered samples enough?
Are all samples useful, or equally useful?
Generating samples form a distribution
7
Assume that we have an algorithm that generates (pseudo-)
random numbers distributed uniformly over (0,1)
How do we generate a sample from these distributions.
First, we see simple cases:
Bernoulli
Multinomial
Other standard distributions
Transformation technique
8
We intend to generate samples form standard distributions map the values generated by uniform random number generator
such the resulting mapped samples have the desired distribution.
Choose function π . such that the resulting values of π¦= π π₯ have some specific desired distribution π π¦ :
π π¦ = π π₯ππ₯
ππ¦
Since π π₯ = 1, we have:
π₯ = ββ
π¦
π π¦β² ππ¦β²
If we define β π¦ β‘ ββ
π¦π π¦β² ππ¦β² β π¦ = ββ1 π₯
Transformation technique
9
Cumulative CDF Sampling:
If π₯~π(0,1), and β(. ) is the CDF of π, then ββ1(π₯) ~π.
Since we need to calculate and then invert the indefinite integral of
π, it will only be feasible for a limited number of simple distributions
Thus, we will see first rejection sampling and importance
sampling (in the next slides) that can be used as important
components in the more general sampling techniques.
Rejection sampling
10
Suppose we wish to sample from π(π) = π(π)/π.
π(π) is difficult to sample, but π(π) is easy to evaluate
We choose a simpler (proposal) distribution π π that we can
sample from it more easily
Where βπ, ππ(π) β₯ π π
Sample from π π : πβ~π(π)
accept πβ with probability π πβ
ππ(πβ)
ππ(π₯)
π(π₯)
π₯π₯β
Rejection sampling
11
Correctness: π π
ππ(π)π(π)
π π
ππ ππ π ππ
= π π
π π ππ= π π
Probability of acceptance:
π ππππππ‘ = π π
ππ ππ π ππ =
π π ππ
π
Adaptive rejection sampling
12
It is difficult to determine a suitable analytic form for π
We can use envelope functions to define π when π(π₯) is log
concave
Intersections of tangent lines are used to construct π
Initially, gradient are evaluated at some initial set of grid points and tangent
lines are found accordingly.
In each iteration, a sample can be drawn from the envelope distribution.
the envelope distribution comprises a piecewise exponential distribution and
drawing a sample from it is straight forward
If the sample is rejected, then it is incorporated into the set of grid points, a
new tangent line is computed, and π is thereby refined.
ln π(π₯)
π₯1 π₯2 π₯3 π₯
High dimensional rejection sampling
13
Problem: low acceptance rate rejection sampling in high
dimensional spaces
exponential decrease of acceptance rate with dimensionality
Example:
Using π = π(π, ππ2π°) to sample π = π(π, ππ
2π°)
If ππ exceeds ππ by 1%, and π = 1000
ππ
ππ
π
β 20,000 and so the optimal acceptance
rate is 1/20,000 that is too small
π(π₯)
π₯
Importance sampling
14
Suppose sampling from π is hard.
Simpler proposal distribution π is used instead.
If π dominates π (i.e., π(π) > 0 whenever π(π) > 0), we can
sample from π and reweight the obtained samples:
πΈπ π π = π π π π ππ = π ππ π
π ππ π ππ
πΈπ π π β1
π π=1
π π π π π π(π)
π π π π π ~π π
πΈπ π π β1
π
π=1
π
π π π π€(π)
π€(π) =π π(π)
π π π
Normalized importance sampling
15
Suppose that we can only evaluate π(π) where π π = π(π)/π:
πΈπ π π = π π π π ππ =ππ
ππ π π
π π
π ππ π ππ
ππ
ππ=
1
ππ π π ππ =
π π
π ππ π ππ = π π π π ππ
πΈπ π π = π π π π π π ππ
π π π π ππ
πΈπ π π β1
πΏ π=1
π π π π π(π)
1
πΏ π=1
π π(π)
πΈπ π π β
π=1
π
π π π π€(π)
π π = π π
π π
π€(π) =π(π)
π=1π π(π)
π π ~π π
Importance sampling: problem
16
Importance sampling depends on how well π matches π
For mismatch distributions, weights may be dominated by few samples having
large weights, with the remaining weights being relatively insignificant
It is common that π(π)π(π) is strongly varying and has a significant
proportion of its mass concentrated in a small region
The problem is severe if none of the samples falls in the regions where
π(π)π(π) is large.
The estimate of the expectation may be severely wrong while the variance of π(π) can
be small
π(π₯) π(π₯) π(π₯)
A key requirement for π(π) is that it should
not be small or zero in regions where π(π)may be significant.
[Bishop book]
Sampling Importance Resampling (SIR)
17
SIR sampling:
First, πΏ samples π(1), . . . , π(πΏ) are drawn from π(π).
Then, weights π€(1), . . . ,π€(π) are found.
Finally, a second set of πΏ samples is drawn from the discrete distribution
π(1), . . . , π(πΏ) with probabilities given by the weights π€(1), . . . , π€(π) .
The resulting πΏ samples are only approximately
distributed according to π(π) , but the distribution
becomes correct in the limit πΏ β β.
Sampling methods for graphical models
18
DGMs:
Forward (or ancestral) sampling
Likelihood weighted sampling
For UGMs, there is no one-pass sampling strategy that will
sample even from the prior distribution with no observed
variables.
Instead, computationally more expensive techniques such as Gibbs sampling exist
that will be introduced in the next slides
Sampling the joint distribution represented
by a BN
19
Sample the joint distribution by ancestral sampling
Example:
Sample from π(π·) β π· = π1
Sample from π(πΌ) β πΌ = π0
Sample from π πΊ π0, π1 β πΊ = π3
Sample from π π π0 β π = π 0
Sample from π πΏ π3 β πΏ = π0
One sample π1, π0, π3, π 0, π0 was
generated
Forward sampling in a BN
20
Given a BN, and number of samples π
Choose a topological ordering on variables, e.g., π1, β¦ , ππ
For j = 1 to N
For i = 1 to M
Sample π₯π(π)
from the distribution π(ππ|πππππ
(π))
Add {π₯1(π)
, β¦ , π₯π(π)
} to the sample set
Sampling for conditional probability query
21
π π1 π0 , π 0) =?
Looking at the samples we can count:
π: # of samples
ππ: # of samples in which the evidence holds (πΏ = π0, π = π 0)
ππΌ : # of samples where the joint is true (πΏ = π0, π = π 0, πΌ = π1)
For a large enough π ππ/π β π(π0, π 0)
ππΌ/π β π(π1, π0, π 0)
And so, we can set
π π1 π0 , π 0) =π(π1,π0,π 0)
π(π0,π 0)β ππΌ/ππ
Using rejection sampling to compute π(πΏ|π)
22
Given a BN, a query π(πΏ|π), and number of samples π
Choose a topological ordering on variables, e.g., π1, β¦ , ππ
j=1
While j<N
For i = 1 to M
Sample π₯π(π)
from the distribution π(ππ|πππππ
(π))
If {π₯1(π)
, β¦ , π₯π(π)
} consistent with evidence π add it to sample set and
j=j+1
Use samples to compute π(πΏ|π)
Forward sampling: problem
23
Problem: When the evidence rarely happens, we would need
lots of samples, and most would be wasted
overall probability of accepting a sample rapidly decreases as the number
of observed variables increases and as the number of states that those
variables can take increases
This approach is very slow and rarely used in practice.
Likelihood weighting
24
Let π1, β¦ , ππ be a topological ordering of πΊ
π€ = 1
π = π, β¦ , π
for i=1β¦M
if ππ β π
draw π₯π from π ππ π₯ππππ
else
π₯π = π ππ
π€ = π€ β π π₯π|π₯ππππ
π€ π =
ππβπΈ
π π₯π|π₯ππππ
π π₯π|π₯ππππ
ππβπΈ
π π₯π|π₯ππππ
1=
ππβπΈ
π π₯π|π₯ππππ
The above procedure generates a sample π = π₯1, β¦ , π₯π and assigns it a
weight π€:
Efficiency of likelihood weighting
26
The efficiency of importance sampling depends on how
close the proposal π is to the target π.
Two extreme cases:
when all the evidence is at the roots π π = π π π and all
samples will have the same weight 1.
when all the evidence is at the leaves, π π = π(π).
will work reasonably only if π(π) is similar to π π π .
Otherwise, most of our samples will be irrelevant, i.e., their weights
are very low
Limitations of Monte Carlo
27
Direct sampling: only when we can sample from π(π) can be wasteful for rare events
Rejection and importance sampling use a proposal distribution
π π and can also be used when we can not sample π(π)directly
In rejection sampling, when the proposal π(π) is very different from
π(π), most samples are rejected
In importance sampling, when the proposal π(π) is very different from
π(π), most samples have very low weights
Problem: Finding a good proposal π(π) that is similar to π(π) usually
requires knowledge of the analytic form of π(π) that is not available
Markov chain Monte Carlo (MCMC)
28
Instead of using a fixed proposal π(π), we can use an adaptive
proposal π(π|π(π‘)) that depends on the last previous sample
π(π‘)
The proposal distribution is adapted as a function of the last accepted
sample
MCMC methods
Metropolis-Hasting
Gibbs
MCMC
29
MCMC algorithms feature adaptive proposals
Instead of π(πβ²), they use π(πβ²|π) where πβ² is the new state being
sampled, and π is the previous sample
As π changes, π(πβ²|π) can also change (as a function of πβ²)
π π₯β²|π₯(1) π π₯β²|π₯(2)
[This slide has been adapted from Xingβs slides]
Metropolis-Hastings
30
Draws a sample πβ² from π(πβ²|π), where π is the previoussample
The new sample πβ² is accepted with probability π΄(πβ²|π):
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
π΄ πβ² π is like a ratio of importance sampling weights
π πβ²
π(πβ²|π)is the importance weight for πβ²
π π
π(π|πβ²)is the importance weight for π
We use π΄ πβ² π to ensure that after sufficiently many draws,our samples will come from the true distribution π π We will show in the next slides
we only need to compute π πβ²
π(π)rather
than π πβ² or π π separately
Metropolis-Hastings algorithm
31
Initialize starting state π(0), set π‘ = 0
Burn-in: while samples have βnot convergedβ
β’ π = π(π‘)
β’ sample πβ~π(πβ|π)
β’ π΄ πβ π = min 1,π πβ π(π|πβ)
π π π(πβ|π)
β’ sample π’~πππππππ(0,1)
β’ if π’ < π΄ πβ π
β’ π(π‘+1) = πβ
β’ else
β’ π(π‘+1) = π(π‘)
β’ π‘ = π‘ + 1
For n =1β¦N
Draw sample from π(π|π πβ1 ) with acceptance probability π΄ π π(πβ1)
Metropolis-Hastings algorithm: Example
32
Let π(πβ²|π) be a Gaussian centered on π
Weβre trying to sample from a bimodal distribution π(π)
π π₯β²|π₯(0)
[This slide has been adapted from Xingβs slides]
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
Metropolis-Hastings algorithm: Example
33
Let π(πβ²|π) be a Gaussian centered on π
Weβre trying to sample from a bimodal distribution π(π)
π π₯β²|π₯(1)
[This slide has been adapted from Xingβs slides]
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
Metropolis-Hastings algorithm: Example
34
Let π(πβ²|π) be a Gaussian centered on π
Weβre trying to sample from a bimodal distribution π(π)
We reject because π π₯β²
π(π₯β²|π₯(2))< 1 and
π π₯(2)
π(π₯(2)|π₯β²)> 1, hence π΄(π₯β²|π₯(2)) is small
π π₯β²|π₯(2)
[This slide has been adapted from Xingβs slides]
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
Metropolis-Hastings algorithm: Example
35
Let π(πβ²|π) be a Gaussian centered on π
Weβre trying to sample from a bimodal distribution π(π)
π π₯β²|π₯(3)
[This slide has been adapted from Xingβs slides]
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
Metropolis-Hastings algorithm: Example
36
Let π(πβ²|π) be a Gaussian centered on π
Weβre trying to sample from a bimodal distribution π(π)
The adaptive proposal π(π₯β|π₯) allows
us to sample both modes of π(π₯)!
π π₯β²|π₯(4)
[This slide has been adapted from Xingβs slides]
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
Metropolis algorithm: example
37
Let π(πβ²|π) be a Gaussian centered on π:
π πβ² π = π πβ² π, πππ°
π΄ πβ² π = min 1,π πβ²
π π
A biased random walk that explores
the target distribution π(π)
[Bishop book]
Metropolis algorithm: example
38
π πβ² π = π πβ² π, πππ°
large π2 β many rejections
small π2 β slow exploration
πΏ
π
2iterations are required to reach states
In general, finding a good proposal
distribution is not always easy
Proposal distribution
39
low-variance proposals:
high probability of acceptance
many iterations are required to explore π π
results in more correlated samples
high-variance proposals
low probability of acceptance
have the potential to explore much of π(π)
Results in less correlated samples
How to use Markov chains for sampling
from π π ?
40
Our goal is to use Markov chains to sample from a given
distribution.
We can achieve this if we set up a Markov chain whose unique
stationary distribution is π.
We design the transition distribution π πβ²|π so that the chain
has a unique stationary distribution π π (independent of π)
The ergodic condition is a sufficient condition
Sample π(0) randomly
For t = 0, 1, 2, β¦
Sample π(π‘+1) from π(πβ²|π(π‘))
Stationary distribution & detailed balance
41
ππ‘ π : Probability distribution over state π, at time π‘ Transition probability π πβ²|π redistributes the mass in state π to other
states πβ².
ππ‘+1 πβ² =
π
ππ‘ π π πβ²|π
πβ π is invariant or stationary if it does not change underthe transitions:
πβ πβ² =
π
πβ π π πβ²|π βπβ²
A sufficient condition for ensuring that πβ(π) is stationarydistribution of an MC is the detailed balance condition:
πβ π π πβ²|π = πβ πβ² π π|πβ²
Why does Metropolis-Hastings work?
42
MH is a general construction algorithm that allows us to build
a reversible Markov chain with a particular stationary
distribution π π
If we draw a sample πβ² according to π(πβ²|π) , and then
accept/reject according to π΄(πβ²|π) , we have a transition
kernel:
π πβ² π = π΄(πβ²|π)π(πβ²|π)
π πβ² π = π΄(πβ²|π)π(πβ²|π) if πβ² β π
π π π = π π π +
πβ²β π
π πβ² π 1 β π΄(πβ²|π)
MH satisfies detailed balance
43
Theorem: MH satisfies detailed balance
Proof:
π΄ πβ² π = min 1,π πβ² π(π|πβ²)
π π π(πβ²|π)
If π΄ πβ² π β€ 1 thenπ π π(πβ²|π)
π πβ² π(π|πβ²)β₯ 1 then π΄ π πβ² = 1
Suppose that π΄ πβ² π < 1
π΄ πβ² π =π πβ² π(π|πβ²)
π π π(πβ²|π)β π π π πβ² π π΄ πβ² π = π πβ² π π πβ²
π΄ π πβ²=1
π π π πβ² π π΄ πβ² π = π πβ² π π πβ² π΄ π πβ²
β π π π πβ² π = π πβ² π(π|πβ²)
π πβ² π = π΄(πβ²|π)π(πβ²|π)
MH properties
44
MH algorithm eventually converges to a stationary distribution
π(π) that is the true distribution
However, we have no guarantees as to when this will occur
the burn-in period is a way to ignore the un-converged part of the
Markov chain
but deciding when to halt burn-in is an art that needs experimentation.
π must be chosen to fulfill the technical requirements
Gibbs sampling algorithm
45
Initialize starting values for π₯1, β¦ , π₯π
Do until convergence:
Pick an ordering of the π variables (can be fixed or random)
For each variable π₯π in order:
Sample π₯ from π(ππ|π₯1, β¦ , π₯πβ1, π₯π+1, β¦ , π₯π)
Update π₯π βπ₯
When we update π₯π, we immediately use its new value for sampling other variables π₯π
Suppose the graphical model contains variables π₯1, β¦ , π₯π
the current values of all other variables
Gibbs sampling algorithm
46
Sample π₯1β² from π(π1|π₯2, β¦ , π₯π)
Sample π₯2β² from π(π2|π₯1
β² , π₯3, β¦ , π₯π)
β¦
Sample π₯πβ² from π ππ π₯1
β² , β¦ , π₯πβ1β² , π₯π+1, β¦ , π₯π
β¦
Sample π₯πβ² from π(ππ|π₯1
β² , β¦ , π₯πβ1β² )
Suppose the graphical model contains variables π₯1, β¦ , π₯π
If the current sample is π = π₯1, β¦ , π₯π , the next sample πβ² = π₯1β² , β¦ , π₯π
β²
is drawn as:
Gibbs Sampling
47
Gibbs Sampling is an MCMC algorithm that samples one
random variable of a graphical model at a time
We will see that GS is a special case of MH
GS needs reasonable time and memory
Complete conditionals are fairly easy to derive for many graphical
models (e.g. mixture models, LDA)
For conjugate-exponential models, we can use standard sampling techniques
if conditionals are log concave, then we can use adaptive rejection
sampling
Blocked Gibbs sampling, samples a subset of variables at a time
Gibbs sampling is a special case of MH
48
Gibbs sampling is an MH algorithm with the proposal distribution:
π π₯πβ², πβπ|π₯π , πβπ = π π₯π
β²|πβπ
Using this proposal distribution in MH algorithm, we find thatsamples are always accepted:
π΄ π₯πβ², πβπ|π₯π , πβπ = min 1,
π π₯πβ², πβπ π π₯π , πβπ|π₯π , πβπ
π π₯π , πβπ π π₯πβ², πβπ|π₯π , πβπ
= min 1,π π₯π
β², πβπ π π₯π|πβπ
π π₯π , πβπ π π₯πβ²|πβπ
= min 1,π π₯π
β²|πβπ π πβπ π π₯π|πβπ
π π₯π|πβπ π πβπ π π₯πβ²|πβπ
= 1
GS is simply MH with a proposal that is always accepted!
πβπ: all variables except π₯π
Gibbs sampling: example
49
Target: π π·, πΌ, πΊ|π 1, π0
π0, π0, π1
π0, π0, π1
π0, π1, π1
π0, π1, π1
Initialization
πβ²~π π·|π0, π1, π 1, π0
πβ²~π πΌ|π0, π1, π 1, π0
πβ²~π πΊ|π1, π0, π 1, π0
π(π‘) = π0, π0, π1 β π(π‘+1) = π0, π1, π1
Gibbs sampling:
complete conditional distributions
50
π π₯π|πβπ can often be evaluated quickly, because they only
depends on the neighboring nodes
Let ππ΅ ππ be the Markov Blanket of ππ
For a BN, the Markov Blanket of a variable is the set of variables
containing its parents, children, and co-parents
For an MRF, the Markov Blanket of variable is its immediate neighbors
π ππ π1, β¦ , ππβ1, ππ+1, β¦ , ππ = π(ππ|ππ΅ ππ )
A sufficient condition for ergodicity of Gibbs sampling is that
none of the conditional distributions be anywhere zero
Gibbs sampling implementation
51
Conditionals with a few discrete settings can be explicitlynormalized:
π π₯π|πβπ =π π₯π , πβπ
π₯π
β² π π₯πβ², πβπ
= π: ππβπππππ(ππ) ππ π₯π , ππ\π₯π
π₯π
β² π: ππβπππππ(ππ) ππ π₯πβ², ππ\π₯π
Continuous univariate conditionals can be sampled usingstandard sampling methods for this purpose
e.g., inverse CDF sampling, rejection sampling, etc
depends only on the instantiation
of variables in Xiβs Markov blanket
Gibbs sampling: summary
52
Converts the hard problem of inference to a sequence of
sampling steps from (univariate) conditional distributions
Pros:
Perhaps the simplest Markov chain for graphical models
Computationally efficient: conditionals can often be evaluated quickly
Cons:
Only applies if we can sample from product of factors
For overly complex cases, Metropolis Hastings can provide an effective
alternative
Often slow to mix especially when probabilities are peaked
Acceptance rate is always 1 and high acceptance usually entails slow
exploration
MH & MC: summary
53
To sample from the true distribution π(π) , MH methods use
adaptive proposals π(π|π(π‘)) rather than π(π)
MH allows specifying any proposal π(πβ²|π) But choosing a good π(πβ²|π) requires technical care
Gibbs sampling sets the proposal π(πβ²|π) to the conditional
distribution π π₯πβ²|πβπ and acceptance rate will be one in this
method.
Practical aspects of MCMC
54
Using the samples
Only once the chain mixes, all samples π(π‘) are from thestationary distribution π π So we can use all π(π‘) for π‘ > ππππ₯
However, nearby samples are correlated!
So we should not overestimate the quality of our estimate by simplycounting samples
Selecting the proposal π(πβ²|π) : a good π proposesdistant samples with a sufficiently high acceptance rate
Monitor the acceptance rate
Plot the autocorrelation function
Stop burn-in
55
Stop burn-in when the state distribution is reasonably close to
π π .
Start collecting samples after the chain has run long enough to mix
How do you know if a chain has mixed or not?
In general, you can never prove a chain has mixed
No general-purpose theoretical analysis exists for the mixing time of
graphical models
But in many cases you can show that it has not mixed
There are approaches telling us if a chain has not converged
Compare chain statistics in different windows within a single run of
the chain or across different chains corresponding to runs initialized
differently
Poor mixing
56
Poor mixing (or slow convergence)
MC stays in small regions of the parameter space for long periods of time
When the state space consists of several regions that are connected only
by low-probability transitions.
The most straightforward approach is to use multiple highly
dispersed initial values to start several different chains
Multimodal target distribution in
which our choice of starting values
traps us near one of the modes
Mixing?
57
Each dot is a statistic (e.g., π(π₯ β π))
x-position is its estimated value from chain 1
y-position is its estimated value from chain 2
[Koller & Friedman book]
Autocorrelation coefficient
58
Autocorrelation function of random variable π₯:
ππ₯ π =πππ£ π₯(π‘), π₯(π‘+π)
π£ππ π₯(π‘)
ππ₯ π = π‘=1
πβπ π₯(π‘) β π₯ π₯(π‘+π) β π₯
π‘=1π π₯(π‘) β π₯ 2
π₯ = π‘=1
π π₯(π‘)
π
One way to diagnose a poorly mixing chain is to observe highautocorrelation at distant lags
High autocorrelation leads to smaller effective sample size!
Thinning
59
One strategy for reducing autocorrelation is thinning theoutput
storing only every πth point after the burn-in period.
reduce the autocorrelation by increasing the thinning ratio
Approximately independent samples can be obtained
We may want proposals π(πβ²|π) with low autocorrelation
If the variance of π is too small, moves are accepted with highprobability, but they are also small, generating high autocorrelationsand poor mixing
thinning
all samples
after burn-in
Thinning or not?
60
All of the samples produces a provably better estimator than
using just remained after thinning
Thinning is useful primarily in settings where there is a
significant cost associated with using each sample (for example,
the evaluation of f is costly)
so that we might want to reduce the overall number of particles used.
One long chain or many smaller chains
61
If long burn-in periods are required, or if the chains have very
high autocorrelations:
using a large number of smaller chains may result in each not being
long enough to be of any value.
A good strategy: run a small number of chains in parallel for a
reasonably long time
using their behavior to evaluate mixing.
MCMC implementation details
62
Tunable parameters or design choices in MCMC:
the proposal distribution
the metrics for evaluating mixing
the number of chains to run
techniques for determining the thinning ratio
MCMC sampling: summary
63
Theoretical guarantees for convergence to exact values
However, it can be quite slow to converge We may need ridiculously many samples
only useful if the chain we are using mixes reasonably quickly
And it is difficult to tell whether it has been converged
Many options that need to be specified The application of Markov chains is more of an art than a science
Unlike forward sampling methods, MCMC methods: do not degrade when the probability of the evidence is low, or when the
posterior is very different from the prior.
apply to undirected models as well as to directed models.
Resources
64
D. Koller and N. Friedman, βProbabilistic Graphical Models:
Principles and Techniquesβ, Chapter 12.
C.M. Bishop, βPattern Recognition and Machine Learningβ,
Springer, Chapter 11.
Theoretical foundation of MH
65
Why are the samples generated by MH method will eventually
come from π(π)?
Why does the MH algorithm have a βburn-inβ period?
Markov chains
66
A Markov chain is a sequence of random variables π(1), β¦ , π(π‘)
with the Markov property
π π(π‘)|π(1), β¦ , π(π‘β1) = π π(π‘)|π(π‘β1)
We focus on homogeneous Markov chains, in which
π π(π‘)|π(π‘β1) is fixed with time
Let π be the previous state and πβ² be the next state, we call
π π(π‘)|π(π‘β1) as π πβ²|π
Thus, at each time point, π(π‘) is a state showing the
configuration of all the variables in the model
Markov chains: invariant or stationary dist.
67
ππ‘ π : Probability distribution over state π, at time π‘ Transition probability π πβ²|π redistributes the mass in state π to other
states πβ².
ππ‘ πβ² =
π
ππ‘ π π πβ²|π
π π is invariant or stationary if it does not change under
the transitions:
π πβ² =
π
π π π πβ²|π βπβ²
π πβ²|π = π π(π‘) = πβ²|π π‘β1 = π
Invariant distributions are of great importance in MCMC methods.
More specific than stationary distribution
68
There is also no guarantees that the stationary distribution is
unique.
In some chains, the stationary distribution reached depends on
our starting distribution π0(π).
We want to restrict our attention to MCs that have a unique
stationary distribution, which is reached from any starting
distribution π0(π).
There are various conditions that suffice to guarantee this
property.
The most commonly used condition is ergodicity.
Equilibrium distribution
69
Stationary distribution π of transition probabilities π is called
the equilibrium distribution when limπ‘ββ
ππ‘ = π independent
of the initial distribution π0
Ergodicity implies you can reach the stationary distribution π(π), no
matter the initial distribution π0(π)
Clearly, an ergodic Markov chain can have only one equilibrium
distribution.
Ergodic Markov chain
70
A homogeneous Markov chain will be ergodic, subject only to
the following conditions:
Irreducible: an MC is irreducible if we can get from any state π to any
other state πβ² with probability > 0 in a finite number of steps
Aperiodic: an MC is aperiodic if you can return to any state π at any
time
Periodic MCs have states that need β₯2 time steps to return to (cycles)
Ergodic: an MC is ergodic if it is irreducible and aperiodic
An ergodic MC converges to a unique stationary distribution regardless
of the start state
Finite state space MC
71
A Markov chain is ergodic or regular if there exists πsuch that, for every π, πβ², the probability of getting from πto πβ² in exactly π steps is > 0
Sufficient set of conditions for regularity:
Every two states are connected
For every state, there is a self-transition
How to use Markov chains for sampling
from π π ?
72
Our goal is to use Markov chains to sample from a given
distribution.
We can achieve this if we set up a Markov chain whose unique
stationary distribution is π.
We design the transition distribution π πβ²|π so that the chain
has a unique stationary distribution π π
Sample π(0) randomly
For t = 0, 1, 2, β¦
Sample π(π‘+1) from π(πβ²|π(π‘))
Detailed balance
73
A sufficient (but not necessary) condition for ensuring that
π(π) is stationary distribution of an MC is the detailed
balance condition:
π π π πβ²|π = π πβ² π π|πβ²
Reversible chain: an MC is reversible if there exists a dist.
π(π) such that the detailed balance condition is satisfied
Enforcing detailed balance is easy
Detailed balance is local (only involves isolated pairs)
Detailed balance means the sequences πβ², π and π, πβ² are
equally probable (although the probability of πβ² β π and πβ πβ² can be different)
Reversible Chains
74
Theorem: Detailed balance implies the stationarydistribution
Proof:
π π π πβ²|π = π πβ² π π|πβ²
β
π
π π π πβ²|π =
π
π πβ² π π|πβ²
β
π
π π π πβ²|π = π πβ²
π
π π|πβ² = π πβ²
Theorem: If detailed balance holds and π is ergodic, thenπ has a unique stationary distribution