Post on 16-Jan-2016
description
.
Inference V:MCMC Methods
Stochastic Sampling
In previous class, we examined methods that use independent samples to estimate P(X = x |e )
Problem: It is difficult to sample from P(X1, …. Xn |e )
We had to use likelihood weighting to reweigh our samples
This introduced bias in estimation In some case, such as when the evidence is on
leaves, these methods are inefficient
MCMC Methods
We are going to discuss sampling methods that are based on Markov Chain
Markov Chain Monte Carlo (MCMC) methods
Key ideas: Sampling process as a Markov Chain
Next sample depends on the previous one These will approximate any posterior distribution
We start by reviewing key ideas from the theory of Markov chains
Markov Chains Suppose X1, X2, … take some set of values
wlog. These values are 1, 2, ... A Markov chain is a process that corresponds to the
network:
To quantify the chain, we need to specify Initial probability: P(X1) Transition probability: P(Xt+1|Xt)
A Markov chain has stationary transition probability
P(Xt+1|Xt) is the same for all times t
X1 X2 X3Xn... ...
Irreducible Chains
A state j is accessible from state i if there is an n such that P(Xn = j | X1 = i) > 0
There is a positive probability of reaching j from i after some number steps
A chain is irreducible if every state is accessible from every state
Ergodic Chains
A state is positively recurrent if there is a finite expected time to get back to state i after being in state i
If X has finite number of states, then this is suffices that i is accessible from itself
A chain is ergodic if it is irreducible and every state is positively recurrent
(A)periodic Chains
A state i is periodic if there is an integer d such thatP(Xn = i | X1 = i ) = 0 when n is not divisible by d
A chain is aperiodic if it contains no periodic state
Stationary Probabilities
Thm: If a chain is ergodic and aperiodic, then the limit
exists, and does not depend on i
Moreover, let
then, P*(X) is the unique probability satisfying
)|(lim 1 iXXP nn
=∞→
)|(lim)( 1* iXjXPjXP n
n====
∞→
∑ ===== +i
tt iXPiXjXPjXP )()|()( *1
*
Stationary Probabilities
The probability P*(X) is the stationary probability of the process
Regardless of the starting point, the process will converge to this probability
The rate of convergence depends on properties of the transition probability
Sampling from the stationary probability
This theory suggests how to sample from the stationary probability:
Set X1 = i, for some random/arbitrary i For t = 1, 2, …, n
Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt) return xn
If n is large enough, then this is a sample from P*(X)
Designing Markov Chains
How do we construct the right chain to sample from?
Ensuring aperiodicity and irreducibility is usually easy
Problem is ensuring the desired stationary probability
Designing Markov Chains
Key tool: If the transition probability satisfies
then, P*(X) = Q(X) This gives a local criteria for checking that the chain
will have the right stationary distribution
0)|1(whenever)()(
)|(
)|(
1
1 >==+==
==== =
+
+ itXjtXPiXQjXQ
jXiXPiXjXP
tt
tt
MCMC Methods
We can use these results to sample from P(X1,…,Xn|e)
Idea: Construct an ergodic & aperiodic Markov Chain
such that P*(X1,…,Xn) = P(X1,…,Xn|e)
Simulate the chain n steps to get a sample
MCMC Methods
Notes: The Markov chain variable Y takes as value
assignments to all variables that are consistent evidence
For simplicity, we will denote such a state using the vector of variables
}satisfy,...,|)()(,...,{)( 1111 enn xxXVXVxxYV ××∈= L
Gibbs Sampler
One of the simplest MCMC method At each transition change the state of just on Xi
We can describe the transition probability as a stochastic procedure:
Input: a state x1,…,xn Choose i at random (using uniform probability) Sample x’i from
P(Xi|x1, …, xi-1, xi+1 ,…, xn, e) let x’j = xj for all j i return x’1,…,x’n
Correctness of Gibbs Sampler
By chain rule
P(x1, …, xi-1, xi, xi+1 ,…, xn|e) =P(x1, …, xi-1, xi+1 ,…, xn|e)P(xi|x1, …, xi-1, xi+1 ,…, xn, e)
Thus, we get
Since we choose i from the same distribution at each stage, this procedure satisfies the ratio criteria
),,,,,,|'(),,,,,,|(
)|,,,',,,()|,,,,,,(
111
111
111
111
ee
ee
niii
niii
niii
niii
xxxxxPxxxxxP
xxxxxPxxxxxP
KKKK
KKKK
+−
+−
+−
+− =
Gibbs Sampling for Bayesian Network
Why is the Gibbs sampler “easy” in BNs? Recall that the Markov blanket of a variable
separates it from the other variables in the network P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )
This property allows us to use local computations to perform sampling in each transition
Gibbs Sampling in Bayesian Networks
How do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?
Let Y1, …, Yk be the children of Xi
By definition of Mbi, the parents of Yj are in Mbi{Xi}
It is easy to show that
∑ ∏∏
=
i
j
j
x jyiii
jyiii
ii payPPaxP
payPPaxP
MbxP
'
)|()|'(
)|()|(
)|(
Sampling Strategy
How do we collect the samples?
Strategy I: Run the chain M times, each run for N steps
each run starts from a different state points Return the last state in each run
M chains
Sampling Strategy
Strategy II: Run one chain for a long time After some “burn in” period, sample points every
some fixed number of steps
“burn in” M samples from one chain
Comparing Strategies
Strategy I: Better chance of “covering” the space of points
especially if the chain is slow to reach stationarity Have to perform “burn in” steps for each chain
Strategy II: Perform “burn in” only once Samples might be correlated (although only weakly)
Hybrid strategy: run several chains, and sample few samples
from each Combines benefits of both strategies