Statistical modeling with stochastic...
Transcript of Statistical modeling with stochastic...
![Page 1: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/1.jpg)
Statistical modeling with stochastic processes
Alexandre Bouchard-CôtéLecture 3, Monday March 7
1Tuesday, March 8, 2011
![Page 2: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/2.jpg)
Plan for today
Exact inference review
Approximate inference, part I: MCMC Gibbs Metropolis-Hastings Overview of theoretical results available Tricks of the trade
2Tuesday, March 8, 2011
![Page 3: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/3.jpg)
Review
3Tuesday, March 8, 2011
![Page 4: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/4.jpg)
Why do we know the marginals? By definition!
0 500 1000 1500 2000
0
0.2
0.4
0.6
0.8
1
What are the bare minimum conditions for λ to be marginals of Ys ? I.e. we want λs(A) = P(Ys ∈ A), etc
s1 s2
A
4Tuesday, March 8, 2011
![Page 5: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/5.jpg)
Why do we know the marginals? By definition!
0 500 1000 1500 2000
0
0.2
0.4
0.6
0.8
1
What are the bare minimum conditions for λ to be marginals of Ys ? I.e. we want λs(A) = P(Ys ∈ A), etc
s1 s2
A
!s1(A) = !s1,s2(A,R) [marginalization]!s1,s2(A1, A2) = !s2,s1(A2, A1) [perm]
4Tuesday, March 8, 2011
![Page 6: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/6.jpg)
Why do we know the marginals? By definition!
What are the bare minimum conditions for λ to be marginals of Ys ?
!s1(A) = !s1,s2(A,R) [marginalization]!s1,s2(A1, A2) = !s2,s1(A2, A1) [perm]
Kolmogorov: if these consistency conditions hold for any finite number of variables (not just a pair), then there is a joint stochastic process with these marginals.
5Tuesday, March 8, 2011
![Page 7: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/7.jpg)
The Bayesian choice
Task: given an observed random variable Y, what value should we guess for a related random variable X which is unobserved?
Criterion: if we make a guess x and the real value is x*, we pay a cost of L(x,x*) --- this is called a loss function.
In the Bayesian framework: you should answer
argminxE(L(x, X)|Y )
6Tuesday, March 8, 2011
![Page 8: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/8.jpg)
Directed Graphical Models
Example:
Interpretation: the collection of all distributions that can be factorized as
p(x,y,z) = p1(x) p2(y|x) p3(z|y)
for some non-negative pi s such that for each w:
∫ pi(v|w) m(dv) = 1
X Y Z
7Tuesday, March 8, 2011
![Page 9: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/9.jpg)
Undirected Graphical Models
Example:
Interpretation: the collection of all distributions such that their density that can be factorized as
p(x,y,z) = f1(x,y) f2(y,z)
for some non-negative fi
X Y Z
8Tuesday, March 8, 2011
![Page 10: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/10.jpg)
Exact inference and dynamic programming
y1
x1
y2
x2
y3
x3θ(c)
T(c)π
c = 1,2
c = 1,2
Suppose: parameters are known, so we condition on them
9Tuesday, March 8, 2011
![Page 11: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/11.jpg)
Exact inference and dynamic programming
Next step: turning the directed model into an undirected one
X Y
VX Y
V‘Moralization’
10Tuesday, March 8, 2011
![Page 12: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/12.jpg)
Exact inference and dynamic programming
Simplifying undirected models:...
... ...
... ...
... ...
...Y0 = y0
f(x,y) f’(x)
f’(x) = f(x, y0)
11Tuesday, March 8, 2011
![Page 13: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/13.jpg)
Exact inference and dynamic programming
Simplifications:
12Tuesday, March 8, 2011
![Page 14: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/14.jpg)
P (X = x|Y = y0) =f1(x)f !
2(x)!x! f1(x!)f !
2(x!)
=f1(x)f !
2(x)Z
Exact inference and dynamic programming
Consequence of simplification: renormalization needed
Example:
f2(y|x) = p2(y|x)
f1(x) = p1(x)
f’2(x) = p2(y0|x)
f1(x)
Bayes rule: can interpret Z as P(Y = y0)
13Tuesday, March 8, 2011
![Page 15: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/15.jpg)
Further simplifications
... ...
...
... ...
...
f ’2(x,y) = f1(x) f2(x,y)
f1(x)f2(x,y) f’2(x,y)
Pointwise multiplication
...
...
... ...
...
...
f(x,y)X
Yf’(x)
f !(x) =!
y
f(x, y)Marginalization
14Tuesday, March 8, 2011
![Page 16: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/16.jpg)
Efficient inference: elimination algorithm
Consequence: for chains, efficient computation of Z and one-node or two-nodes marginals for tree-shaped undirected graphical models
4 x 4 operations 4 x 4 operations
Much less operations than naive enumeration!In general: if a chain has length T and N states, computing Z takes T N2 operations instead of NT
For tree-shaped models: same story!For non-tree models: we need to figure out something else...
15Tuesday, March 8, 2011
![Page 17: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/17.jpg)
MCMC
16Tuesday, March 8, 2011
![Page 18: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/18.jpg)
MCMC methods
What it does: Same as the elimination algorithm (normalization and posterior), but not limited to trees.
Output: a list of samples, i.e. the model with values for the hidden nodes filled in (imputed)
BB
B
BP
P P
PP B
B
B
BB
P P
PB B
B
B
PB
P P
PB
...
17Tuesday, March 8, 2011
![Page 19: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/19.jpg)
A bit of history
MC: Usually credited to Stanislaw Ulam, during the Manhattan project.
MCMC: Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. They ran their chain for 48 iterations on a computer called MANIAC (it took five hours still)
18Tuesday, March 8, 2011
![Page 20: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/20.jpg)
MCMC methods: how does it work?
Things to discuss:
How to compute posterior expectations from these samples (e.g. Bayes estimator)
How to create the samples so that they are approximately distributed according to the posterior?
How to compute Z from these samples
BB
B
BP
P P
PP B
B
B
BB
P P
PB B
B
B
PB
P P
PB
...
19Tuesday, March 8, 2011
![Page 21: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/21.jpg)
Ef(X) ! 1S
S!
i=1
f(X(i))
Computing the posterior
BB
B
BP
P P
PP B
B
B
BB
P P
PB B
B
B
PB
P P
PB
Samples:
X1,1X1,3
Monte Carlo estimator: for S samples, compute
(1) (1)X1,1
X1,3(3) (3)
In discrete models, f is generally a vector of indicator functions on
variables and values e.g.f1,3;B(X(3)) = 1
20Tuesday, March 8, 2011
![Page 22: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/22.jpg)
MCMC methods: how does it work?
Things to discuss: (note assume for now state is discrete)
How to compute posterior expectations from these samples (e.g. Bayes estimator)
How to create the samples so that they are approximately distributed according to the posterior?
How to compute Z from these samples
BB
B
BP
P P
PP B
B
B
BB
P P
PB B
B
B
PB
P P
PB
...
21Tuesday, March 8, 2011
![Page 23: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/23.jpg)
Let’s start by an easy special case: ‘Naive’ Gibbs sampling
Initialization: guess arbitrary values for the hidden nodes
BB
B
B
P P
PB
P
Idea: at each iteration, maintain a guess for all the hidden nodes
22Tuesday, March 8, 2011
![Page 24: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/24.jpg)
Let’s start by an easy special case: ‘Naive’ Gibbs sampling
BB
B
B
P P
PB
P
Loop: pick one node (i,j) at random, erase the contents of the guessed values in (i,j), and freeze the value of the other nodes
BB
B
B
P P
PB
Then: resample a value for the node (i,j) conditioning on all the others, and write this to the current state at (i,j)
BB
B
B
P P
PB
Easy!B
BB
B
B
P
PB
B
P23Tuesday, March 8, 2011
![Page 25: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/25.jpg)
Better Gibbs samplers
BB
B
B
P P
PB
P
Loop: pick a subset of nodes N at random, erase the contents of the guessed values in N, freeze the value of the nodes not in N
B
B
B
P P
B
Then: resample a value for the nodes in N conditioning on all the others, and write this to the current state at N
B
B
B
P P
B
Easy?
BP
B
B
P
PB
P
P
P P P
24Tuesday, March 8, 2011
![Page 26: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/26.jpg)
Next: Metropolis Hastings
Theoretical framework: The goal is to approximate
Method: build a giant Markov chain T converging to target(x)
This construction is called a Metropolis-Hastings chain and Gibbs sampling is a special case of it.
target(x) = P(X = x|obs, params)
Why does it work?
25Tuesday, March 8, 2011
![Page 27: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/27.jpg)
Asymptotic/stationary distribution:
Transition matrix:
statio(x) = limt!"
P(Xt = x|X0 = !)
Next: Metropolis Hastings
Markov chain:
Ts,s’ = P(Xt +1 = s’ | Xt = s)
... ...
Each state is a full copy of the latent
variables!B
P
B
B
P
P
B
P
P
... ...
26Tuesday, March 8, 2011
![Page 28: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/28.jpg)
Metropolis Hastings
Why it’s huge: 29 x 29 matrix
T =
B
P
B
B
P
P
B
P
P
0.1 0.010.01...
...
Way to large to represent in memory but we will compute
entries on the fly
B
P
B
B
P
P
B
P
P
B
P
B
B
P
P
B
P
P
B
P
B
B
P
P
B
P
P
...
...
27Tuesday, March 8, 2011
![Page 29: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/29.jpg)
Question
How to build T such that:
First step: finding a better expression for statio(x)
target(x) = P(X = x|obs, params)
statio(x) = target(x)
statio(x) = limt!"
P(Xt = x|X = 0 = !)
28Tuesday, March 8, 2011
![Page 30: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/30.jpg)
P(Xt+2 = s!|Xt = s) =
!"
s!!
Ts,s!!Ts!!,s!
#
s,s!
=!T 2
"s,s!
Tn
Finding a better expression for statio(x)
Ts,s’ = P(Xt +1 = s’ | Xt = s)One step transition:
Two steps transition:
n-steps transition:
Note: this is a special case of an important principle: Chapman–Kolmogorov equation
s s’’ s’
29Tuesday, March 8, 2011
![Page 31: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/31.jpg)
T! = limn"!
Tn
Finding a better expression for statio(x)
Definition (‘infinite steps’ transition):
What (matrix-valued) equation should the infinite transition satisfy?
30Tuesday, March 8, 2011
![Page 32: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/32.jpg)
T! = limn"!
Tn
Finding a better expression for statio(x)
Definition (‘infinite steps’ transition):
T! = T!T
What (matrix-valued) equation should the infinite transition satisfy?
30Tuesday, March 8, 2011
![Page 33: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/33.jpg)
T! = limn"!
Tn
T! =
!
Finding a better expression for statio(x)
Definition (‘infinite steps’ transition):
Hope: B
P
B
B
P
P
B
P
P
B
P
B
B
P
P
B
P
P
...
!
!...
That would mean that no matter what state we use to initialize the sampler, the distribution over the n-th state converges to a distribution called the stationary
distribution π(x) = statio(x) = target(x)31Tuesday, March 8, 2011
![Page 34: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/34.jpg)
T! = limn"!
Tn
T! =
!
! = !T!(x) =!
y
!(y)Ty,x
Finding a better expression for statio(x)
Definition (‘infinite steps’ transition):
Hope: B
P
B
B
P
P
B
P
P
B
P
B
B
P
P
B
P
P
...
!
!...
When this is the case (will see later the conditions):
or32Tuesday, March 8, 2011
![Page 35: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/35.jpg)
target(x) =!
y
target(y)Ty,x
Building T such that statio(x) = target(x)
From previous result, want T such that:
Next: Let’s see if Gibbs satisfies this equation!
Definition: Let R denote the set of states reachable by the current Gibbs move
E.g.: in previous Ising example,it has two elements
BB
B
B
P P
PB B
B
B
B
P P
PB
B P
R33Tuesday, March 8, 2011
![Page 36: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/36.jpg)
target(x) =!
y
target(y)Ty,x
Building T such that statio(x) = target(x)
Goal: Let’s see if Gibbs satisfies this equation
First: Let’s find what is Ty,x
BB
B
B
P P
PB B
B
B
B
P P
PB
B P
R
(1)
34Tuesday, March 8, 2011
![Page 37: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/37.jpg)
target(x) =!
y
target(y)Ty,x
Ty,x =1[x, y ! R]target(x)!x! 1[x!, y ! R]target(x!)
Building T such that statio(x) = target(x)
Goal: Let’s see if Gibbs satisfies this equation
First: Let’s find what is Ty,x
BB
B
B
P P
PB B
B
B
B
P P
PB
B P
R
(1)
34Tuesday, March 8, 2011
![Page 38: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/38.jpg)
target(x) =!
y
target(y)Ty,x
Ty,x =1[x, y ! R]target(x)!x! 1[x!, y ! R]target(x!)
Building T such that statio(x) = target(x)
Goal: Let’s see if Gibbs satisfies this equation
First: Let’s find what is Ty,x
BB
B
B
P P
PB B
B
B
B
P P
PB
B P
R
(1)
Finally: plug-in (2) in (1) and check it works
(2)
34Tuesday, March 8, 2011
![Page 39: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/39.jpg)
Gibbs is not always applicable
Example: non-conjugate prior; in which case even a single node has no analytic posterior expression
Generalization: instead of requiring T be proportional to the target distribution, use arbitrary proposal q and correct the discrepancy between q and the target distribution
Terminology: Metropolis-Hastings
35Tuesday, March 8, 2011
![Page 40: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/40.jpg)
A(xt!1 ! xprop) = min!
1,target(xprop)q(xt!1|xprop)target(xt!1)q(xprop|xt!1)
"
Metropolis-Hastings meta-algorithm
Initialize x0 arbitrarilyF = 0; N = 0For t = 1... S
1. Propose a new state xprop according to q( - | xt-1) 2. Compute:
3. Set xt to xprop with probability A(xt-1 ➛ xprop), otherwise set xt to xt-1
4. F = F + f(xt), N = N + 1Return F/N
Metropolis-Hastings( target(x), q(xnext|xcur), f(x) )
! E[f(X)] for X " target
36Tuesday, March 8, 2011
![Page 41: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/41.jpg)
target(x) =!
y
target(y)Ty,x
target(x)Tx,y = target(y)Ty,x
Why Metropolis-Hastings works
From previous result, want T such that:
Sufficient condition (by summing over y on both sides):
This is called detailed balance or reversibility condition
37Tuesday, March 8, 2011
![Page 42: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/42.jpg)
Why Metropolis-Hastings works
target(x)Tx,y = target(y)Ty,x
Goal: checking detailed balance for the MH kernel T
First: what is Tx,y ? When x = y, the result trivially holds, so let’s assume that x ≠ y
When x ≠ y, Tx,y is equal to the probability that (1) y is proposed by q(-|x) times (2) the probability that it is accepted:
Tx,y = q(y|x)A(x! y)
38Tuesday, March 8, 2011
![Page 43: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/43.jpg)
Why Metropolis-Hastings works
target(x)Tx,y = target(y)Ty,x
Final step: using the form of Tx,y for x ≠ y to check detailed balance for the MH kernel T
Tx,y = q(y|x)A(x! y)
Goal:
Known:
A(xt!1 ! xprop) = min!
1,target(xprop)q(xt!1|xprop)target(xt!1)q(xprop|xt!1)
"
39Tuesday, March 8, 2011
![Page 44: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/44.jpg)
Notes on Metropolis-Hastings
Critical: the target and proposal densities always appear as ratios, so if they are only known up to a normalization Z, the normalizations cancel out
A(xt!1 ! xprop) = min!
1,target(xprop)q(xt!1|xprop)target(xt!1)q(xprop|xt!1)
"
Practical note: should be computed in log space and exponentiated only after taking ratio (difference of logs)
Special cases: when q is symmetric (e.g. isotropic normal), the q’s cancel out as well. When q(-|xcur) is independent of xcur, it’s called an independence chain (still has dependence because of A)
40Tuesday, March 8, 2011
![Page 45: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/45.jpg)
T! = limn"!
Tn
T! =
!
Useful theoretical results
Definition (‘infinite steps’ transition):
Hope: B
P
B
B
P
P
B
P
P
B
P
B
B
P
P
B
P
P
...
!
!...
When does this limit make sense (when does it exists
and is stochastic)?
When does the limit has this form?
41Tuesday, March 8, 2011
![Page 46: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/46.jpg)
T =
Counter-example 1
0 1
1 0
Limit not even defined!
Problem: a waltz between states
Definition: A state s (or chain) has period k if any return to state s must occur in multiples of k steps. The chain is aperiodic if one (all) states have period 1.
Easy to avoid: add epsilon self-transitions
T!
42Tuesday, March 8, 2011
![Page 47: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/47.jpg)
T =
T! = T
Counter-example 2
1/2 1/2 01/2 1/2 00 0 1
Exercise: the asymptotic distribution depends on the starting state. In fact:
Problem: some pairs of states cannot reach each other
Definition: An irreducible chain is a chain where there is a path between each pair of states (for each x, y there is an integer n such that (Tn)x,y > 0)
43Tuesday, March 8, 2011
![Page 48: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/48.jpg)
Are the MH and Gibbs kernels we have introduced earlier irreducible?
Example: is this irreducible?B
B
B
B
P P
P
B B
B
B
B
P P
P
B
B P
R
44Tuesday, March 8, 2011
![Page 49: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/49.jpg)
Are the MH and Gibbs kernels we have introduced earlier irreducible?
Example: is this irreducible?
Ty,x =1[x, y ! R]target(x)!x! 1[x!, y ! R]target(x!)
B
B
B
B
P P
P
B B
B
B
B
P P
P
B
B P
R
44Tuesday, March 8, 2011
![Page 50: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/50.jpg)
Are the MH and Gibbs kernels we have introduced earlier irreducible?
Example: is this irreducible?
Ty,x =1[x, y ! R]target(x)!x! 1[x!, y ! R]target(x!)
B
B
B
B
P P
P
B B
B
B
B
P P
P
B
B P
R
T =9!
k=1
!kT (k)
Solution 1: mixing kernels. Suppose we have one Gibbs kernel for each variable T(1), ..., T(9). Then the mixture of them is also reversible (by linearity)
44Tuesday, March 8, 2011
![Page 51: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/51.jpg)
Tx,y =!
x1
· · ·!
x9
T (1)x,x1
T (2)x1,x2
· · · T (9)x8,x!
Are the MH and Gibbs kernels we have introduced earlier irreducible?
T =9!
k=1
!kT (k)
Solution 1: mixing kernels. Suppose we have one Gibbs kernel for each variable T(1), ..., T(9). Then the mixture of them is also reversible (by linearity)
Solution 2: alternating kernels deterministically (ie. using the first, then second, etc).
Often works better: shuffle then alternate45Tuesday, March 8, 2011
![Page 52: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/52.jpg)
Existence of π such that π = πT
Suppose: (still assuming discrete state space)1. T is irreducible2. T is aperiodic
Consequence: There is a unique probability distribution π such that π = πT
Proofs: Consequence of Perron–Frobenius theorem (Tn is positive for n large enough, and π is then the eigenvector corresponding to the unique eigenvalue of highest modulus). --- Note: can be used to debug samplers
More general arguments exist46Tuesday, March 8, 2011
![Page 53: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/53.jpg)
limn!"
Tx,y = !(y)
Convergence theorem 1
Suppose: (still assuming discrete state space)1. T is irreducible2. T is aperiodic
Consequence: There is a unique probability distribution π such that π = πT ; moreover, for all x,
i.e.:T! =
!
!
!
...
n
47Tuesday, March 8, 2011
![Page 54: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/54.jpg)
Proof: coupling argumentIdea: simulate a pair of chains (Xt, Yt) such that the marginal transitions are given by T:
P(Xt =x’|Xt-1 =x, Yt-1 =y ) = Txx’
Joint distribution: simulate independent transitions if x ≠ y, and identical transitions if x =y.
t= 1 2 3 4
Each point is a possible configuration
of latent variablesB
P
B
B
P
P
B
P
P
5
Xt
Yt
48Tuesday, March 8, 2011
![Page 55: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/55.jpg)
Proof: coupling argument
Initial distributions: X0 ~ π and Y0 ~ arbitrary distribution
Note: Xt ~ π for all t since π = πT
Goal: showing that
t= 1 2 3 4 5
Xt
Yt
limn!"
!
y
"""P(Xn = y)! P(Yn = y)""" = 0
Notation: the hitting time H where Xt=Yt for the first
time.e.g. H=3 here
49Tuesday, March 8, 2011
![Page 56: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/56.jpg)
limn!"
1S
S!
t=1
f(Xt) =!
x
f(x)!(x)
This is not exactly what we need though...
Recall: the goal is to compute an expectation (with respect to the posterior distribution), not to sample!
Connexion: the law of large numbers
Vanilla version: If Xt are iid π and f is finite, then
Misconception: to have the same conclusion hold for MCMC, we need to burn-in and/or thin the chain
50Tuesday, March 8, 2011
![Page 57: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/57.jpg)
1S/n
!
t!{n,2n,...,S}
f(Xt)
Burn-in and thinning
Burn-in:
Thinning:
1S ! b
S!
t=b
f(Xt)
51Tuesday, March 8, 2011
![Page 58: Statistical modeling with stochastic processesbouchard/courses/stat547-sp2011/lecture3.pdfStatistical modeling with stochastic processes ... Exact inference review Approximate inference,](https://reader036.fdocuments.us/reader036/viewer/2022070722/5f01c1737e708231d400e1e2/html5/thumbnails/58.jpg)
Burn-in and thinning are unnecessary
The law of large numbers for Markov chains: If Xt is an irreducible Markov chain with stationary distribution π and f is finite, then
Note 1: Aperiodicity not needed for this resultNote 2: For small S, burning-in might improve the estimator, but might as well maximize during burn-inNote 2: Thinning to reduce auto-correlation is not a good idea and can be harmful (only reasons to do it is to save memory writes or memory---but most of the time only finite dimensional sufficient statistics need to be stored)
limn!"
1S
S!
t=1
f(Xt) =!
x
f(x)!(x)
52Tuesday, March 8, 2011