mcmc
description
Transcript of mcmc
![Page 1: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/1.jpg)
MCMC Methods: Gibbs Sampling and theMetropolis-Hastings Algorithm
Patrick Lam
![Page 2: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/2.jpg)
Outline
Introduction to Markov Chain Monte Carlo
Gibbs Sampling
The Metropolis-Hastings Algorithm
![Page 3: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/3.jpg)
Outline
Introduction to Markov Chain Monte Carlo
Gibbs Sampling
The Metropolis-Hastings Algorithm
![Page 4: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/4.jpg)
What is Markov Chain Monte Carlo (MCMC)?
Markov Chain: a stochastic process in which future states areindependent of past states given the present state
Monte Carlo: simulation
Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called MonteCarlo Integration.
Basically a fancy way of saying we can take quantities of interest ofa distribution from simulated draws from the distribution.
![Page 5: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/5.jpg)
What is Markov Chain Monte Carlo (MCMC)?
Markov Chain: a stochastic process in which future states areindependent of past states given the present state
Monte Carlo: simulation
Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called MonteCarlo Integration.
Basically a fancy way of saying we can take quantities of interest ofa distribution from simulated draws from the distribution.
![Page 6: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/6.jpg)
What is Markov Chain Monte Carlo (MCMC)?
Markov Chain: a stochastic process in which future states areindependent of past states given the present state
Monte Carlo: simulation
Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called MonteCarlo Integration.
Basically a fancy way of saying we can take quantities of interest ofa distribution from simulated draws from the distribution.
![Page 7: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/7.jpg)
What is Markov Chain Monte Carlo (MCMC)?
Markov Chain: a stochastic process in which future states areindependent of past states given the present state
Monte Carlo: simulation
Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called MonteCarlo Integration.
Basically a fancy way of saying we can take quantities of interest ofa distribution from simulated draws from the distribution.
![Page 8: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/8.jpg)
What is Markov Chain Monte Carlo (MCMC)?
Markov Chain: a stochastic process in which future states areindependent of past states given the present state
Monte Carlo: simulation
Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called MonteCarlo Integration.
Basically a fancy way of saying we can take quantities of interest ofa distribution from simulated draws from the distribution.
![Page 9: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/9.jpg)
Monte Carlo Integration
Suppose we have a distribution p(θ) (perhaps a posterior) that wewant to take quantities of interest from.
To derive it analytically, we need to take integrals:
I =
∫Θ
g(θ)p(θ)dθ
where g(θ) is some function of θ (g(θ) = θ for the mean andg(θ) = (θ − E (θ))2 for the variance).
We can approximate the integrals via Monte Carlo Integration bysimulating M values from p(θ) and calculating
IM =1
M
M∑i=1
g(θ(i))
![Page 10: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/10.jpg)
Monte Carlo Integration
Suppose we have a distribution p(θ) (perhaps a posterior) that wewant to take quantities of interest from.
To derive it analytically, we need to take integrals:
I =
∫Θ
g(θ)p(θ)dθ
where g(θ) is some function of θ (g(θ) = θ for the mean andg(θ) = (θ − E (θ))2 for the variance).
We can approximate the integrals via Monte Carlo Integration bysimulating M values from p(θ) and calculating
IM =1
M
M∑i=1
g(θ(i))
![Page 11: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/11.jpg)
Monte Carlo Integration
Suppose we have a distribution p(θ) (perhaps a posterior) that wewant to take quantities of interest from.
To derive it analytically, we need to take integrals:
I =
∫Θ
g(θ)p(θ)dθ
where g(θ) is some function of θ (g(θ) = θ for the mean andg(θ) = (θ − E (θ))2 for the variance).
We can approximate the integrals via Monte Carlo Integration bysimulating M values from p(θ) and calculating
IM =1
M
M∑i=1
g(θ(i))
![Page 12: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/12.jpg)
Monte Carlo Integration
Suppose we have a distribution p(θ) (perhaps a posterior) that wewant to take quantities of interest from.
To derive it analytically, we need to take integrals:
I =
∫Θ
g(θ)p(θ)dθ
where g(θ) is some function of θ (g(θ) = θ for the mean andg(θ) = (θ − E (θ))2 for the variance).
We can approximate the integrals via Monte Carlo Integration bysimulating M values from p(θ) and calculating
IM =1
M
M∑i=1
g(θ(i))
![Page 13: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/13.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 14: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/14.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 15: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/15.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 16: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/16.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 17: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/17.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I :
IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 18: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/18.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 19: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/19.jpg)
For example, we can compute the expected value of the Beta(3,3)distribution analytically:
E (θ) =
∫Θ
θp(θ)dθ =
∫Θ
θΓ(6)
Γ(3)Γ(3)θ2(1− θ)2dθ =
1
2
or via Monte Carlo methods:
> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.5013
Our Monte Carlo approximation IM is a simulation consistentestimator of the true value I : IM → I as M →∞.
We know this to be true from the Strong Law of Large Numbers.
![Page 20: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/20.jpg)
Strong Law of Large Numbers (SLLN)
Let X1,X2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (Xi ).
Then with probability 1,
X1 + X2 + · · ·+ XM
M→ µ as M →∞
In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.
But what if we can’t generate draws that are independent?
![Page 21: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/21.jpg)
Strong Law of Large Numbers (SLLN)
Let X1,X2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (Xi ).
Then with probability 1,
X1 + X2 + · · ·+ XM
M→ µ as M →∞
In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.
But what if we can’t generate draws that are independent?
![Page 22: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/22.jpg)
Strong Law of Large Numbers (SLLN)
Let X1,X2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (Xi ).
Then with probability 1,
X1 + X2 + · · ·+ XM
M→ µ as M →∞
In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.
But what if we can’t generate draws that are independent?
![Page 23: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/23.jpg)
Strong Law of Large Numbers (SLLN)
Let X1,X2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (Xi ).
Then with probability 1,
X1 + X2 + · · ·+ XM
M→ µ as M →∞
In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.
But what if we can’t generate draws that are independent?
![Page 24: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/24.jpg)
Strong Law of Large Numbers (SLLN)
Let X1,X2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (Xi ).
Then with probability 1,
X1 + X2 + · · ·+ XM
M→ µ as M →∞
In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.
This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.
But what if we can’t generate draws that are independent?
![Page 25: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/25.jpg)
Suppose we want to draw from our posterior distribution p(θ|y),but we cannot sample independent draws from it.
For example, we often do not know the normalizing constant.
However, we may be able to sample draws from p(θ|y) that areslightly dependent.
If we can sample slightly dependent draws using a Markov chain,then we can still find quantities of interests from those draws.
![Page 26: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/26.jpg)
Suppose we want to draw from our posterior distribution p(θ|y),but we cannot sample independent draws from it.
For example, we often do not know the normalizing constant.
However, we may be able to sample draws from p(θ|y) that areslightly dependent.
If we can sample slightly dependent draws using a Markov chain,then we can still find quantities of interests from those draws.
![Page 27: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/27.jpg)
Suppose we want to draw from our posterior distribution p(θ|y),but we cannot sample independent draws from it.
For example, we often do not know the normalizing constant.
However, we may be able to sample draws from p(θ|y) that areslightly dependent.
If we can sample slightly dependent draws using a Markov chain,then we can still find quantities of interests from those draws.
![Page 28: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/28.jpg)
Suppose we want to draw from our posterior distribution p(θ|y),but we cannot sample independent draws from it.
For example, we often do not know the normalizing constant.
However, we may be able to sample draws from p(θ|y) that areslightly dependent.
If we can sample slightly dependent draws using a Markov chain,then we can still find quantities of interests from those draws.
![Page 29: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/29.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 30: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/30.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 31: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/31.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 32: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/32.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 33: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/33.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 34: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/34.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 35: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/35.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 36: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/36.jpg)
What is a Markov Chain?
Definition: a stochastic process in which future states areindependent of past states given the present state
Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.
I think of Θ as our parameter space.
I consecutive implies a time component, indexed by t.
Consider a draw of θ(t) to be a state at iteration t. The next drawθ(t+1) is dependent only on the current draw θ(t), and not on anypast draws.
This satisfies the Markov property:
p(θ(t+1)|θ(1),θ(2), . . . ,θ(t)) = p(θ(t+1)|θ(t))
![Page 37: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/37.jpg)
So our Markov chain is a bunch of draws of θ that are each slightlydependent on the previous one.
The chain wanders around theparameter space, remembering only where it has been in the lastperiod.
What are the rules governing how the chain jumps from one stateto another at each period?
The jumping rules are governed by a transition kernel, which is amechanism that describes the probability of moving to some otherstate based on the current state.
![Page 38: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/38.jpg)
So our Markov chain is a bunch of draws of θ that are each slightlydependent on the previous one. The chain wanders around theparameter space, remembering only where it has been in the lastperiod.
What are the rules governing how the chain jumps from one stateto another at each period?
The jumping rules are governed by a transition kernel, which is amechanism that describes the probability of moving to some otherstate based on the current state.
![Page 39: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/39.jpg)
So our Markov chain is a bunch of draws of θ that are each slightlydependent on the previous one. The chain wanders around theparameter space, remembering only where it has been in the lastperiod.
What are the rules governing how the chain jumps from one stateto another at each period?
The jumping rules are governed by a transition kernel, which is amechanism that describes the probability of moving to some otherstate based on the current state.
![Page 40: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/40.jpg)
So our Markov chain is a bunch of draws of θ that are each slightlydependent on the previous one. The chain wanders around theparameter space, remembering only where it has been in the lastperiod.
What are the rules governing how the chain jumps from one stateto another at each period?
The jumping rules are governed by a transition kernel, which is amechanism that describes the probability of moving to some otherstate based on the current state.
![Page 41: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/41.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 42: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/42.jpg)
Transition Kernel
For discrete state space (k possible states):
a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 43: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/43.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 44: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/44.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3.
The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 45: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/45.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 46: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/46.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state.
The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 47: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/47.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 48: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/48.jpg)
Transition Kernel
For discrete state space (k possible states): a k × k matrix oftransition probabilities.
Example: Suppose k = 3. The 3× 3 transition matrix P would be
p(θ(t+1)A |θ(t)
A ) p(θ(t+1)B |θ(t)
A ) p(θ(t+1)C |θ(t)
A )
p(θ(t+1)A |θ(t)
B ) p(θ(t+1)B |θ(t)
B ) p(θ(t+1)C |θ(t)
B )
p(θ(t+1)A |θ(t)
C ) p(θ(t+1)B |θ(t)
C ) p(θ(t+1)C |θ(t)
C )
where the subscripts index the 3 possible values that θ can take.
The rows sum to one and define a conditional PMF, conditional onthe current state. The columns are the marginal probabilities ofbeing in a certain state in the next period.
For continuous state space (infinite possible states), the transition
kernel is a bunch of conditional PDFs: f (θ(t+1)j |θ(t)
i )
![Page 49: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/49.jpg)
How Does a Markov Chain Work? (Discrete Example)
1. Define a starting distribution∏(0) (a 1× k vector of
probabilities that sum to one).
2. At iteration 1, our distribution∏(1) (from which θ(1) is
drawn) is ∏(1) =∏(0) × P
(1× k) (1× k) × (k × k)
3. At iteration 2, our distribution∏(2) (from which θ(2) is
drawn) is ∏(2) =∏(1) × P
(1× k) (1× k) × (k × k)
4. At iteration t, our distribution∏(t) (from which θ(t) is
drawn) is∏(t) =
∏(t−1)× P =∏(0)× Pt
![Page 50: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/50.jpg)
How Does a Markov Chain Work? (Discrete Example)
1. Define a starting distribution∏(0) (a 1× k vector of
probabilities that sum to one).
2. At iteration 1, our distribution∏(1) (from which θ(1) is
drawn) is ∏(1) =∏(0) × P
(1× k) (1× k) × (k × k)
3. At iteration 2, our distribution∏(2) (from which θ(2) is
drawn) is ∏(2) =∏(1) × P
(1× k) (1× k) × (k × k)
4. At iteration t, our distribution∏(t) (from which θ(t) is
drawn) is∏(t) =
∏(t−1)× P =∏(0)× Pt
![Page 51: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/51.jpg)
How Does a Markov Chain Work? (Discrete Example)
1. Define a starting distribution∏(0) (a 1× k vector of
probabilities that sum to one).
2. At iteration 1, our distribution∏(1) (from which θ(1) is
drawn) is ∏(1) =∏(0) × P
(1× k) (1× k) × (k × k)
3. At iteration 2, our distribution∏(2) (from which θ(2) is
drawn) is ∏(2) =∏(1) × P
(1× k) (1× k) × (k × k)
4. At iteration t, our distribution∏(t) (from which θ(t) is
drawn) is∏(t) =
∏(t−1)× P =∏(0)× Pt
![Page 52: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/52.jpg)
How Does a Markov Chain Work? (Discrete Example)
1. Define a starting distribution∏(0) (a 1× k vector of
probabilities that sum to one).
2. At iteration 1, our distribution∏(1) (from which θ(1) is
drawn) is ∏(1) =∏(0) × P
(1× k) (1× k) × (k × k)
3. At iteration 2, our distribution∏(2) (from which θ(2) is
drawn) is ∏(2) =∏(1) × P
(1× k) (1× k) × (k × k)
4. At iteration t, our distribution∏(t) (from which θ(t) is
drawn) is∏(t) =
∏(t−1)× P =∏(0)× Pt
![Page 53: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/53.jpg)
How Does a Markov Chain Work? (Discrete Example)
1. Define a starting distribution∏(0) (a 1× k vector of
probabilities that sum to one).
2. At iteration 1, our distribution∏(1) (from which θ(1) is
drawn) is ∏(1) =∏(0) × P
(1× k) (1× k) × (k × k)
3. At iteration 2, our distribution∏(2) (from which θ(2) is
drawn) is ∏(2) =∏(1) × P
(1× k) (1× k) × (k × k)
4. At iteration t, our distribution∏(t) (from which θ(t) is
drawn) is∏(t) =
∏(t−1)× P =∏(0)× Pt
![Page 54: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/54.jpg)
Stationary (Limiting) Distribution
Define a stationary distribution π to be some distribution∏
suchthat π = πP.
For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of ourstarting points.
So if we can devise a Markov chain whose stationary distribution πis our desired posterior distribution p(θ|y), then we can run thischain to get draws that are approximately from p(θ|y) once thechain has converged.
![Page 55: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/55.jpg)
Stationary (Limiting) Distribution
Define a stationary distribution π to be some distribution∏
suchthat π = πP.
For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of ourstarting points.
So if we can devise a Markov chain whose stationary distribution πis our desired posterior distribution p(θ|y), then we can run thischain to get draws that are approximately from p(θ|y) once thechain has converged.
![Page 56: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/56.jpg)
Stationary (Limiting) Distribution
Define a stationary distribution π to be some distribution∏
suchthat π = πP.
For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of ourstarting points.
So if we can devise a Markov chain whose stationary distribution πis our desired posterior distribution p(θ|y), then we can run thischain to get draws that are approximately from p(θ|y) once thechain has converged.
![Page 57: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/57.jpg)
Stationary (Limiting) Distribution
Define a stationary distribution π to be some distribution∏
suchthat π = πP.
For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of ourstarting points.
So if we can devise a Markov chain whose stationary distribution πis our desired posterior distribution p(θ|y), then we can run thischain to get draws that are approximately from p(θ|y) once thechain has converged.
![Page 58: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/58.jpg)
Burn-in
Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.
However, the time it takes for the chain to converge variesdepending on the starting point.
As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.
However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.
![Page 59: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/59.jpg)
Burn-in
Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.
However, the time it takes for the chain to converge variesdepending on the starting point.
As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.
However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.
![Page 60: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/60.jpg)
Burn-in
Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.
However, the time it takes for the chain to converge variesdepending on the starting point.
As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.
However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.
![Page 61: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/61.jpg)
Burn-in
Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.
However, the time it takes for the chain to converge variesdepending on the starting point.
As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.
However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.
![Page 62: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/62.jpg)
Burn-in
Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.
However, the time it takes for the chain to converge variesdepending on the starting point.
As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.
However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.
![Page 63: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/63.jpg)
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p(θ|y), so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.
One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
![Page 64: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/64.jpg)
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p(θ|y),
so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.
One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
![Page 65: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/65.jpg)
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p(θ|y), so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.
One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
![Page 66: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/66.jpg)
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p(θ|y), so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.
One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
![Page 67: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/67.jpg)
Monte Carlo Integration on the Markov Chain
Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p(θ|y), so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.
One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).
Luckily, we have the Ergodic Theorem.
![Page 68: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/68.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 69: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/69.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 70: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/70.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 71: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/71.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN,
and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 72: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/72.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 73: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/73.jpg)
Ergodic Theorem
Let θ(1),θ(2), . . . ,θ(M) be M values from a Markov chain that isaperiodic, irreducible, and positive recurrent (then the chain isergodic), and E [g(θ)] < ∞.
Then with probability 1,
1
M
M∑i=1
g(θi ) →∫
Θg(θ)π(θ)dθ
as M →∞, where π is the stationary distribution.
This is the Markov chain analog to the SLLN, and it allows us toignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.
But what does it mean for a chain to be aperiodic, irreducible, andpositive recurrent, and therefore ergodic?
![Page 74: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/74.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 75: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/75.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 76: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/76.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain.
The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 77: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/77.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 78: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/78.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 79: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/79.jpg)
Aperiodicity
A Markov chain is aperiodic if the only length of time for whichthe chain repeats some cycle of values is the trivial case with cyclelength equal to one.
Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of stepsthat it takes to return to a certain state.
?>=<89:;A
0
1 // ?>=<89:;B
0
1 // ?>=<89:;C
0
1
ff
As long as the chain is not repeating an identical cycle, then thechain is aperiodic.
![Page 80: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/80.jpg)
Irreducibility
A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
?>=<89:;A
0.5
0.5 // ?>=<89:;B
0.7
0.3(( ?>=<89:;C
0.4
0.6
gg
The chain is not irreducible because we cannot get to A from B orC regardless of the number of steps we take.
![Page 81: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/81.jpg)
Irreducibility
A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
?>=<89:;A
0.5
0.5 // ?>=<89:;B
0.7
0.3(( ?>=<89:;C
0.4
0.6
gg
The chain is not irreducible because we cannot get to A from B orC regardless of the number of steps we take.
![Page 82: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/82.jpg)
Irreducibility
A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
?>=<89:;A
0.5
0.5 // ?>=<89:;B
0.7
0.3(( ?>=<89:;C
0.4
0.6
gg
The chain is not irreducible because we cannot get to A from B orC regardless of the number of steps we take.
![Page 83: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/83.jpg)
Irreducibility
A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
?>=<89:;A
0.5
0.5 // ?>=<89:;B
0.7
0.3(( ?>=<89:;C
0.4
0.6
gg
The chain is not irreducible because we cannot get to A from B orC regardless of the number of steps we take.
![Page 84: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/84.jpg)
Irreducibility
A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).
The following chain is reducible, or not irreducible.
?>=<89:;A
0.5
0.5 // ?>=<89:;B
0.7
0.3(( ?>=<89:;C
0.4
0.6
gg
The chain is not irreducible because we cannot get to A from B orC regardless of the number of steps we take.
![Page 85: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/85.jpg)
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chainstarts at i , it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.
![Page 86: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/86.jpg)
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chainstarts at i , it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.
![Page 87: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/87.jpg)
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chainstarts at i , it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return timeto state i is finite;
otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.
![Page 88: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/88.jpg)
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chainstarts at i , it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.
![Page 89: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/89.jpg)
Positive Recurrence
A Markov chain is recurrent if for any given state i , if the chainstarts at i , it will eventually return to i with probability 1.
A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent.
So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.
![Page 90: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/90.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 91: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/91.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 92: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/92.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 93: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/93.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 94: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/94.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 95: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/95.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 96: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/96.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 97: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/97.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 98: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/98.jpg)
Thinning the Chain
In order to break the dependence between draws in the Markovchain, some have suggested only keeping every dth draw of thechain.
This is known as thinning.
Pros:
I Perhaps gets you a little closer to i.i.d. draws.
I Saves memory since you only store a fraction of the draws.
Cons:
I Unnecessary with ergodic theorem.
I Shown to increase the variance of your Monte Carlo estimates.
![Page 99: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/99.jpg)
So Really, What is MCMC?
MCMC is a class of methods in which we can simulate draws thatare slightly dependent and are approximately from a (posterior)distribution.
We then take those draws and calculate quantities of interest forthe (posterior) distribution.
In Bayesian statistics, there are generally two MCMC algorithmsthat we use: the Gibbs Sampler and the Metropolis-Hastingsalgorithm.
![Page 100: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/100.jpg)
So Really, What is MCMC?
MCMC is a class of methods in which we can simulate draws thatare slightly dependent and are approximately from a (posterior)distribution.
We then take those draws and calculate quantities of interest forthe (posterior) distribution.
In Bayesian statistics, there are generally two MCMC algorithmsthat we use: the Gibbs Sampler and the Metropolis-Hastingsalgorithm.
![Page 101: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/101.jpg)
So Really, What is MCMC?
MCMC is a class of methods in which we can simulate draws thatare slightly dependent and are approximately from a (posterior)distribution.
We then take those draws and calculate quantities of interest forthe (posterior) distribution.
In Bayesian statistics, there are generally two MCMC algorithmsthat we use: the Gibbs Sampler and the Metropolis-Hastingsalgorithm.
![Page 102: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/102.jpg)
So Really, What is MCMC?
MCMC is a class of methods in which we can simulate draws thatare slightly dependent and are approximately from a (posterior)distribution.
We then take those draws and calculate quantities of interest forthe (posterior) distribution.
In Bayesian statistics, there are generally two MCMC algorithmsthat we use: the Gibbs Sampler and the Metropolis-Hastingsalgorithm.
![Page 103: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/103.jpg)
Outline
Introduction to Markov Chain Monte Carlo
Gibbs Sampling
The Metropolis-Hastings Algorithm
![Page 104: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/104.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 105: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/105.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 106: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/106.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 107: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/107.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters:
p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 108: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/108.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 109: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/109.jpg)
Gibbs Sampling
Suppose we have a joint distribution p(θ1, . . . , θk) that we want tosample from (for example, a posterior distribution).
We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.
For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p(θj |θ−j , y)
How can we know the joint distribution simply by knowing the fullconditional distributions?
![Page 110: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/110.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 111: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/111.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y).
The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 112: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/112.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 113: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/113.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 114: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/114.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 115: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/115.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 116: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/116.jpg)
The Hammersley-Clifford Theorem (for two blocks)
Suppose we have a joint density f (x , y). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y) and f (y |x):
f (x , y) =f (y |x)∫ f (y |x)f (x |y)dy
We can write the denominator as∫f (y |x)
f (x |y)dy =
∫ f (x ,y)f (x)
f (x ,y)f (y)
dy
=
∫f (y)
f (x)dy
=1
f (x)
![Page 117: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/117.jpg)
Thus, our right-hand side is
f (y |x)1
f (x)
= f (y |x)f (x)
= f (x , y)
The theorem shows that knowledge of the conditional densitiesallows us to get the joint density.
This works for more than two blocks of parameters.
But how do we figure out the full conditionals?
![Page 118: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/118.jpg)
Thus, our right-hand side is
f (y |x)1
f (x)
= f (y |x)f (x)
= f (x , y)
The theorem shows that knowledge of the conditional densitiesallows us to get the joint density.
This works for more than two blocks of parameters.
But how do we figure out the full conditionals?
![Page 119: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/119.jpg)
Thus, our right-hand side is
f (y |x)1
f (x)
= f (y |x)f (x)
= f (x , y)
The theorem shows that knowledge of the conditional densitiesallows us to get the joint density.
This works for more than two blocks of parameters.
But how do we figure out the full conditionals?
![Page 120: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/120.jpg)
Thus, our right-hand side is
f (y |x)1
f (x)
= f (y |x)f (x)
= f (x , y)
The theorem shows that knowledge of the conditional densitiesallows us to get the joint density.
This works for more than two blocks of parameters.
But how do we figure out the full conditionals?
![Page 121: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/121.jpg)
Thus, our right-hand side is
f (y |x)1
f (x)
= f (y |x)f (x)
= f (x , y)
The theorem shows that knowledge of the conditional densitiesallows us to get the joint density.
This works for more than two blocks of parameters.
But how do we figure out the full conditionals?
![Page 122: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/122.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 123: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/123.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y).
To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 124: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/124.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 125: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/125.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 126: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/126.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 127: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/127.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 128: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/128.jpg)
Steps to Calculating Full Conditional Distributions
Suppose we have a posterior p(θ|y). To calculate the fullconditionals for each θ, do the following:
1. Write out the full posterior ignoring constants ofproportionality.
2. Pick a block of parameters (for example, θ1) and dropeverything that doesn’t depend on θ1.
3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p(θ1|θ−1, y) is).
4. Repeat steps 2 and 3 for all parameter blocks.
![Page 129: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/129.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience). Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 130: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/130.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience). Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 131: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/131.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience). Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 132: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/132.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience). Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 133: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/133.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience).
Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 134: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/134.jpg)
Gibbs Sampler Steps
Let’s suppose that we are interested in sampling from the posteriorp(θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.
The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are
1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) and
drawing θ(0) from it.)
2. Start with any θ (order does not matter, but I’ll start with θ1
for convenience). Draw a value θ(1)1 from the full conditional
p(θ1|θ(0)2 , θ
(0)3 , y).
![Page 135: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/135.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y).
Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 136: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/136.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 137: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/137.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values.
(Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 138: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/138.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 139: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/139.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 140: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/140.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 141: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/141.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 142: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/142.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior.
We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 143: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/143.jpg)
3. Draw a value θ(1)2 (again order does not matter) from the full
conditional p(θ2|θ(1)1 , θ
(0)3 , y). Note that we must use the
updated value of θ(1)1 .
4. Draw a value θ(1)3 from the full conditional p(θ3|θ(1)
1 , θ(1)2 , y)
using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to getQ(1) and then drawing θ(1) from
Q(1).)
5. Draw θ(2) using θ(1) and continually using the most updatedvalues.
6. Repeat until we get M draws, with each draw being a vectorθ(t).
7. Optional burn-in and/or thinning.
Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.
![Page 144: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/144.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 145: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/145.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 146: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/146.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 147: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/147.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 148: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/148.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.
Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 149: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/149.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 150: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/150.jpg)
An Example (Robert and Casella, 10.17)1
Suppose we have data of the number of failures (yi ) for each of 10pumps in a nuclear plant.
We also have the times (ti ) at which each pump was observed.
> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
> t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
> rbind(y, t)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 5 1 5 14 3 19 1 1 4 22
t 94 16 63 126 5 31 1 1 2 10
We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time ti .
Our likelihood is∏10
i=1 Poisson(λi ti ).
1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Methods, 2nd edition. Springer.
![Page 151: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/151.jpg)
Let’s put a Gamma(α, β) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.
Also, let’s put a Gamma(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi s and β).
Our posterior is
p(λ, β|y, t) ∝
(10∏i=1
Poisson(λi ti )×Gamma(α, β)
)×Gamma(γ, δ)
=
(10∏i=1
e−λi ti (λi ti )yi
yi !× βα
Γ(α)λα−1
i e−βλi
)
× δγ
Γ(γ)βγ−1e−δβ
![Page 152: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/152.jpg)
Let’s put a Gamma(α, β) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.
Also, let’s put a Gamma(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi s and β).
Our posterior is
p(λ, β|y, t) ∝
(10∏i=1
Poisson(λi ti )×Gamma(α, β)
)×Gamma(γ, δ)
=
(10∏i=1
e−λi ti (λi ti )yi
yi !× βα
Γ(α)λα−1
i e−βλi
)
× δγ
Γ(γ)βγ−1e−δβ
![Page 153: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/153.jpg)
Let’s put a Gamma(α, β) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.
Also, let’s put a Gamma(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi s and β).
Our posterior is
p(λ, β|y, t) ∝
(10∏i=1
Poisson(λi ti )×Gamma(α, β)
)×Gamma(γ, δ)
=
(10∏i=1
e−λi ti (λi ti )yi
yi !× βα
Γ(α)λα−1
i e−βλi
)
× δγ
Γ(γ)βγ−1e−δβ
![Page 154: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/154.jpg)
Let’s put a Gamma(α, β) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.
Also, let’s put a Gamma(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi s and β).
Our posterior is
p(λ, β|y, t) ∝
(10∏i=1
Poisson(λi ti )×Gamma(α, β)
)×Gamma(γ, δ)
=
(10∏i=1
e−λi ti (λi ti )yi
yi !× βα
Γ(α)λα−1
i e−βλi
)
× δγ
Γ(γ)βγ−1e−δβ
![Page 155: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/155.jpg)
Let’s put a Gamma(α, β) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.
Also, let’s put a Gamma(γ, δ) prior on β, with γ = 0.01 and δ = 1.
So our model has 11 parameters that are unknown (10 λi s and β).
Our posterior is
p(λ, β|y, t) ∝
(10∏i=1
Poisson(λi ti )×Gamma(α, β)
)×Gamma(γ, δ)
=
(10∏i=1
e−λi ti (λi ti )yi
yi !× βα
Γ(α)λα−1
i e−βλi
)
× δγ
Γ(γ)βγ−1e−δβ
![Page 156: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/156.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 157: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/157.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 158: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/158.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 159: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/159.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 160: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/160.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 161: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/161.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 162: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/162.jpg)
p(λ, β|y, t) ∝
(10∏i=1
e−λi ti (λi ti )yi × βαλα−1
i e−βλi
)×βγ−1e−δβ
=
(10∏i=1
λyi+α−1i e−(ti+β)λi
)β10α+γ−1e−δβ
Finding the full conditionals:
p(λi |λ−i , β, y, t) ∝ λyi+α−1i e−(ti+β)λi
p(β|λ, y, t) ∝ e−β(δ+P10
i=1 λi )β10α+γ−1
p(λi |λ−i , β, y, t) is a Gamma(yi + α, ti + β) distribution.
p(β|λ, y, t) is a Gamma(10α + γ, δ +∑10
i=1 λi ) distribution.
![Page 163: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/163.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 164: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/164.jpg)
Coding the Gibbs Sampler
1. Define starting values for β
(we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 165: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/165.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 166: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/166.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 167: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/167.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional
(we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 168: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/168.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 169: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/169.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 170: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/170.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 171: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/171.jpg)
Coding the Gibbs Sampler
1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).
> beta.cur <- 1
2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).
> lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
3. Draw β(1) from its full conditional, using λ(1).
> beta.update <- function(alpha, gamma, delta, lambda, y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
![Page 172: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/172.jpg)
4. Repeat using most updated values until we get M draws.
5. Optional burn-in and thinning.
6. Make it into a function.
![Page 173: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/173.jpg)
4. Repeat using most updated values until we get M draws.
5. Optional burn-in and thinning.
6. Make it into a function.
![Page 174: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/174.jpg)
4. Repeat using most updated values until we get M draws.
5. Optional burn-in and thinning.
6. Make it into a function.
![Page 175: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/175.jpg)
> gibbs <- function(n.sims, beta.start, alpha, gamma, delta,
+ y, t, burnin = 0, thin = 1) {
+ beta.draws <- c()
+ lambda.draws <- matrix(NA, nrow = n.sims, ncol = length(y))
+ beta.cur <- beta.start
+ lambda.update <- function(alpha, beta, y, t) {
+ rgamma(length(y), y + alpha, t + beta)
+ }
+ beta.update <- function(alpha, gamma, delta, lambda,
+ y) {
+ rgamma(1, length(y) * alpha + gamma, delta + sum(lambda))
+ }
+ for (i in 1:n.sims) {
+ lambda.cur <- lambda.update(alpha = alpha, beta = beta.cur,
+ y = y, t = t)
+ beta.cur <- beta.update(alpha = alpha, gamma = gamma,
+ delta = delta, lambda = lambda.cur, y = y)
+ if (i > burnin & (i - burnin)%%thin == 0) {
+ lambda.draws[(i - burnin)/thin, ] <- lambda.cur
+ beta.draws[(i - burnin)/thin] <- beta.cur
+ }
+ }
+ return(list(lambda.draws = lambda.draws, beta.draws = beta.draws))
+ }
![Page 176: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/176.jpg)
7. Do Monte Carlo Integration on the resulting Markov chain,which are samples approximately from the posterior.
> posterior <- gibbs(n.sims = 10000, beta.start = 1, alpha = 1.8,
+ gamma = 0.01, delta = 1, y = y, t = t)
> colMeans(posterior$lambda.draws)
[1] 0.07113 0.15098 0.10447 0.12321 0.65680 0.62212 0.86522 0.85465
[9] 1.35524 1.92694
> mean(posterior$beta.draws)
[1] 2.389
> apply(posterior$lambda.draws, 2, sd)
[1] 0.02759 0.08974 0.04012 0.03071 0.30899 0.13676 0.55689 0.54814
[9] 0.60854 0.40812
> sd(posterior$beta.draws)
[1] 0.6986
![Page 177: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/177.jpg)
7. Do Monte Carlo Integration on the resulting Markov chain,which are samples approximately from the posterior.
> posterior <- gibbs(n.sims = 10000, beta.start = 1, alpha = 1.8,
+ gamma = 0.01, delta = 1, y = y, t = t)
> colMeans(posterior$lambda.draws)
[1] 0.07113 0.15098 0.10447 0.12321 0.65680 0.62212 0.86522 0.85465
[9] 1.35524 1.92694
> mean(posterior$beta.draws)
[1] 2.389
> apply(posterior$lambda.draws, 2, sd)
[1] 0.02759 0.08974 0.04012 0.03071 0.30899 0.13676 0.55689 0.54814
[9] 0.60854 0.40812
> sd(posterior$beta.draws)
[1] 0.6986
![Page 178: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/178.jpg)
Outline
Introduction to Markov Chain Monte Carlo
Gibbs Sampling
The Metropolis-Hastings Algorithm
![Page 179: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/179.jpg)
Suppose we have a posterior p(θ|y) that we want to sample from,but
I the posterior doesn’t look like any distribution we know (noconjugacy)
I the posterior consists of more than 2 parameters (gridapproximations intractable)
I some (or all) of the full conditionals do not look like anydistributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)
If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.
![Page 180: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/180.jpg)
Suppose we have a posterior p(θ|y) that we want to sample from,but
I the posterior doesn’t look like any distribution we know (noconjugacy)
I the posterior consists of more than 2 parameters (gridapproximations intractable)
I some (or all) of the full conditionals do not look like anydistributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)
If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.
![Page 181: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/181.jpg)
Suppose we have a posterior p(θ|y) that we want to sample from,but
I the posterior doesn’t look like any distribution we know (noconjugacy)
I the posterior consists of more than 2 parameters (gridapproximations intractable)
I some (or all) of the full conditionals do not look like anydistributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)
If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.
![Page 182: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/182.jpg)
Suppose we have a posterior p(θ|y) that we want to sample from,but
I the posterior doesn’t look like any distribution we know (noconjugacy)
I the posterior consists of more than 2 parameters (gridapproximations intractable)
I some (or all) of the full conditionals do not look like anydistributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)
If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.
![Page 183: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/183.jpg)
Suppose we have a posterior p(θ|y) that we want to sample from,but
I the posterior doesn’t look like any distribution we know (noconjugacy)
I the posterior consists of more than 2 parameters (gridapproximations intractable)
I some (or all) of the full conditionals do not look like anydistributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)
If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.
![Page 184: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/184.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 185: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/185.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 186: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/186.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 187: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/187.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 188: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/188.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 189: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/189.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 190: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/190.jpg)
Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm follows the following steps:
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ∗ from a jumpingdistribution Jt(θ
∗|θ(t−1)).
3. Compute an acceptance ratio (probability):
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is notaccepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ|y), withoptional burn-in and/or thinning.
![Page 191: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/191.jpg)
Step 1: Choose a starting value θ(0).
This is equivalent to drawing from our initial stationarydistribution.
The important thing to remember is that θ(0) must have positiveprobability.
p(θ(0)|y) > 0
Otherwise, we are starting with a value that cannot be drawn.
![Page 192: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/192.jpg)
Step 1: Choose a starting value θ(0).
This is equivalent to drawing from our initial stationarydistribution.
The important thing to remember is that θ(0) must have positiveprobability.
p(θ(0)|y) > 0
Otherwise, we are starting with a value that cannot be drawn.
![Page 193: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/193.jpg)
Step 1: Choose a starting value θ(0).
This is equivalent to drawing from our initial stationarydistribution.
The important thing to remember is that θ(0) must have positiveprobability.
p(θ(0)|y) > 0
Otherwise, we are starting with a value that cannot be drawn.
![Page 194: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/194.jpg)
Step 1: Choose a starting value θ(0).
This is equivalent to drawing from our initial stationarydistribution.
The important thing to remember is that θ(0) must have positiveprobability.
p(θ(0)|y) > 0
Otherwise, we are starting with a value that cannot be drawn.
![Page 195: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/195.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 196: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/196.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel).
The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 197: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/197.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 198: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/198.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution),
that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 199: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/199.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 200: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/200.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 201: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/201.jpg)
Step 2: Draw θ∗ from Jt(θ∗|θ(t−1)).
The jumping distribution Jt(θ∗|θ(t−1)) determines where we move
to in the next iteration of the Markov chain (analogous to thetransition kernel). The support of the jumping distribution mustcontain the support of the posterior.
The original Metropolis algorithm required that Jt(θ∗|θ(t−1)) be
a symmetric distribution (such as the normal distribution), that is
Jt(θ∗|θ(t−1)) = Jt(θ
(t−1)|θ∗)
We now know with the Metropolis-Hastings algorithm thatsymmetry is unnecessary.
If we have a symmetric jumping distribution that is dependent onθ(t−1), then we have what is known as random walk Metropolissampling.
![Page 202: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/202.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 203: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/203.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 204: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/204.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 205: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/205.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 206: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/206.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 207: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/207.jpg)
If our jumping distribution does not depend on θ(t−1),
Jt(θ∗|θ(t−1)) = Jt(θ
∗)
then we have what is known as independentMetropolis-Hastings sampling.
Basically all our candidate draws θ∗ are drawn from the samedistribution, regardless of where the previous draw was.
This can be extremely efficient or extremely inefficient, dependingon how close the jumping distribution is to the posterior.
Generally speaking, chain will behave well only if the jumpingdistribution has heavier tails than the posterior.
![Page 208: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/208.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it. Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 209: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/209.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it. Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 210: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/210.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it. Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 211: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/211.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it.
Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 212: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/212.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it. Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 213: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/213.jpg)
Step 3: Compute acceptance ratio r .
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
In the case where our jumping distribution is symmetric,
r =p(θ∗|y)
p(θ(t−1)|y)
If our candidate draw has higher probability than our current draw,then our candidate is better so we definitely accept it. Otherwise,our candidate is accepted according to the ratio of the probabilitiesof the candidate and current draws.
Note that since r is a ratio, we only need p(θ|y) up to a constantof proportionality since p(y) cancels out in both the numeratorand denominator.
![Page 214: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/214.jpg)
In the case where our jumping distribution is not symmetric,
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
We need to weight our evaluations of the draws at the posteriordensities by how likely we are to draw each draw.
For example, if we are very likely to jump to some θ∗, thenJt(θ
∗|θ(t−1)) is likely to be high, so we should accept less of themthan some other θ∗ that we are less likely to jump to.
In the case of independent Metropolis-Hastings sampling,
r =p(θ∗|y)/Jt(θ
∗)
p(θ(t−1)|y)/Jt(θ(t−1))
![Page 215: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/215.jpg)
In the case where our jumping distribution is not symmetric,
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
We need to weight our evaluations of the draws at the posteriordensities by how likely we are to draw each draw.
For example, if we are very likely to jump to some θ∗, thenJt(θ
∗|θ(t−1)) is likely to be high, so we should accept less of themthan some other θ∗ that we are less likely to jump to.
In the case of independent Metropolis-Hastings sampling,
r =p(θ∗|y)/Jt(θ
∗)
p(θ(t−1)|y)/Jt(θ(t−1))
![Page 216: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/216.jpg)
In the case where our jumping distribution is not symmetric,
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
We need to weight our evaluations of the draws at the posteriordensities by how likely we are to draw each draw.
For example, if we are very likely to jump to some θ∗, thenJt(θ
∗|θ(t−1)) is likely to be high, so we should accept less of themthan some other θ∗ that we are less likely to jump to.
In the case of independent Metropolis-Hastings sampling,
r =p(θ∗|y)/Jt(θ
∗)
p(θ(t−1)|y)/Jt(θ(t−1))
![Page 217: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/217.jpg)
In the case where our jumping distribution is not symmetric,
r =p(θ∗|y)/Jt(θ
∗|θ(t−1))
p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
We need to weight our evaluations of the draws at the posteriordensities by how likely we are to draw each draw.
For example, if we are very likely to jump to some θ∗, thenJt(θ
∗|θ(t−1)) is likely to be high, so we should accept less of themthan some other θ∗ that we are less likely to jump to.
In the case of independent Metropolis-Hastings sampling,
r =p(θ∗|y)/Jt(θ
∗)
p(θ(t−1)|y)/Jt(θ(t−1))
![Page 218: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/218.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 219: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/219.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 220: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/220.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 221: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/221.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 222: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/222.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 223: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/223.jpg)
Step 4: Decide whether to accept θ∗.
Accept θ∗ as θ(t) with probability min(r , 1). If θ∗ is not accepted,then θ(t) = θ(t−1).
1. For each θ∗, draw a value u from the Uniform(0,1)distribution.
2. If u ≤ r , accept θ∗ as θ(t). Otherwise, use θ(t−1) as θ(t)
Candidate draws with higher density than the current draw arealways accepted.
Unlike in rejection sampling, each iteration always produces adraw, either θ∗ or θ(t−1).
![Page 224: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/224.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 225: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/225.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 226: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/226.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 227: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/227.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 228: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/228.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 229: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/229.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 230: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/230.jpg)
Acceptance Rates
It is important to monitor the acceptance rate (the fraction ofcandidate draws that are accepted) of your Metropolis-Hastingsalgorithm.
If your acceptance rate is too high, the chain is probably not mixingwell (not moving around the parameter space quickly enough).
If your acceptance rate is too low, your algorithm is too inefficient(rejecting too many candidate draws).
What is too high and too low depends on your specific algorithm,but generally
I random walk: somewhere between 0.25 and 0.50 isrecommended
I independent: something close to 1 is preferred
![Page 231: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/231.jpg)
A Simple Example
Using a random walk Metropolis algorithm to sample from aGamma(1.7, 4.4) distribution with a Normal jumping distributionwith standard deviation of 2.
> mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) {
+ theta.cur <- start
+ draws <- c()
+ theta.update <- function(theta.cur, shape, rate) {
+ theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd)
+ accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur,
+ shape = shape, rate = rate)
+ if (runif(1) <= accept.prob)
+ theta.can
+ else theta.cur
+ }
+ for (i in 1:n.sims) {
+ draws[i] <- theta.cur <- theta.update(theta.cur, shape = shape,
+ rate = rate)
+ }
+ return(draws[(burnin + 1):n.sims])
+ }
> mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2,
+ shape = 1.7, rate = 4.4)
![Page 232: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/232.jpg)
A Simple Example
Using a random walk Metropolis algorithm to sample from aGamma(1.7, 4.4) distribution with a Normal jumping distributionwith standard deviation of 2.
> mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) {
+ theta.cur <- start
+ draws <- c()
+ theta.update <- function(theta.cur, shape, rate) {
+ theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd)
+ accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur,
+ shape = shape, rate = rate)
+ if (runif(1) <= accept.prob)
+ theta.can
+ else theta.cur
+ }
+ for (i in 1:n.sims) {
+ draws[i] <- theta.cur <- theta.update(theta.cur, shape = shape,
+ rate = rate)
+ }
+ return(draws[(burnin + 1):n.sims])
+ }
> mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2,
+ shape = 1.7, rate = 4.4)
![Page 233: mcmc](https://reader034.fdocuments.us/reader034/viewer/2022051608/5453fa7eb1af9ff55a8b45fb/html5/thumbnails/233.jpg)
A Simple Example
Using a random walk Metropolis algorithm to sample from aGamma(1.7, 4.4) distribution with a Normal jumping distributionwith standard deviation of 2.
> mh.gamma <- function(n.sims, start, burnin, cand.sd, shape, rate) {
+ theta.cur <- start
+ draws <- c()
+ theta.update <- function(theta.cur, shape, rate) {
+ theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd)
+ accept.prob <- dgamma(theta.can, shape = shape, rate = rate)/dgamma(theta.cur,
+ shape = shape, rate = rate)
+ if (runif(1) <= accept.prob)
+ theta.can
+ else theta.cur
+ }
+ for (i in 1:n.sims) {
+ draws[i] <- theta.cur <- theta.update(theta.cur, shape = shape,
+ rate = rate)
+ }
+ return(draws[(burnin + 1):n.sims])
+ }
> mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2,
+ shape = 1.7, rate = 4.4)