Markov Chain Monte Carlo - Harvard University
Transcript of Markov Chain Monte Carlo - Harvard University
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Markov Chain Monte Carlo
David A. van Dyk
Statistics Section, Imperial College London
Smithsonian Astrophysical Observatory, March 2014
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Outline
1 BackgroundBayesian StatisticsMonte Carlo IntegrationMarkov Chains
2 Basic MCMC Jumping RulesMetropolis SamplerMetropolis Hastings Sampler
3 Practical Challenges and AdviceDiagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
4 Overview of Recommended Strategy
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Outline
1 BackgroundBayesian StatisticsMonte Carlo IntegrationMarkov Chains
2 Basic MCMC Jumping RulesMetropolis SamplerMetropolis Hastings Sampler
3 Practical Challenges and AdviceDiagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
4 Overview of Recommended Strategy
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Bayesian Statistical Analyses: Likelihood
Likelihood Functions: The distribution of the data given themodel parameters. E.g., Y dist∼ Poisson(λS):
likelihood(λS) = e−λSλYS/Y !
Maximum Likelihood Estimation: Suppose Y = 3
0 2 4 6 8 10 12
0.00
0.10
0.20
lambda
likel
ihoo
d The likelihoodand its normalapproximation.
Can estimate λS and its error bars.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Bayesian Analyses: Prior and Posterior Dist’ns
Prior Distribution: Knowledge obtained prior to current data.
Bayes Theorem and Posterior Distribution:
posterior(λ) ∝ likelihood(λ)prior(λ)
Combine past and current information:
0 2 4 6 8 10 120.00
0.10
0.20
0.30
lambda
prio
r
0 2 4 6 8 10 120.00
0.10
0.20
lambda
post
erio
r
Bayesian analyses rely on probability theoryDavid A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Why be Bayesian?
Avoid Gaussian assumptionsMethods like χ2 fitting implicitly assume a Gaussian model.Many other methods rely on asymptotic Gaussianproperties (e.g., stemming from central limit theorem).
Bayesian methods rely directly on probability calculus.Designed to combine multiple sources of informationand/or external sources of information.Modern computational methods allow us to work withspecially-tailored models and methods.
Selection effects, contaminated data, observational biases,complex physics-based models, data distortion, calibrationuncertainty, measurement errors, etc.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Simulating from the Posterior Distribution
We can simulate or sample from a distribution to learnabout its contours.With the sample alone, we can learn about the posterior.
Here, Y dist∼ Poisson(λS + λB) and YBdist∼ Poisson(cλB).
prior
lamS
0 2 4 6 8
0.0
0.2
0.4
0.6
0 2 4 6 8 10
0.0
0.2
0.4
posterior
lamS
joint posterior with flat prior
lamS
lam
B
0 2 4 6 81.
52.
02.
5
.
..
.
.
.
.
.
.
...
.
......
.
.
.
.
.
.
..
.
.
.
..
. .
.
..
.
..
.. .
.
.
..
.
..
.
.
.
.
.
.
.
.
.
..
.
..
.
.
.
. ... .
.
.
.
.
..
..
.
..
. .
..
.
.
.
.
.
.
..
.
.
.
.
...
.
.
..
.
.
.
. .
.
.
.
.
.
.
..
..
.
...
.
.
.
.
.
.
.
.
.
.
..
.
.
.. .... .
.
.
.
. . .
.
.
.
.
.
.
.
.
.
.
.....
..
.
.
.
. .
.
.
....
.
.
. .
.
..
..
..
..
.
.
.
.
.
. ...
.
.
.
.
.
..
.
...
.
..
. .
.
..
.
.
.
...
.
. .
.
.
.
..
..
..
.. . .
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
..
.
..
.
.
..
.
.
.
..
. .
.
.
.
..
.
.
.
.
..
.
..
.
..
.
.
.
..
.
. ..
.
.
..
.
.
.
.
.
.
.
.
. .
.
..
.
.
..
. .
.
.
.
.
.
.
..
.
.
.
.
.
.. .
.
.
...
..
.
.
. .
.
.
.
.
.
.
...
.. ..
.
. ..
..
.
..
.
.
.
.
.
.
.
..
.
...
.
.
.
.
..
. ..
.
.
.
. .
.. .
.
.
..
. ..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
..
..
.
.
....
..
.
.
.
.
.
.
. ...
.
.
.
.
.
.
.
.
..
.. .
.
.
.
..
.
.
.
.
.
.
..
..
.
.
.
.
.
.
...
.
.
..
.
.
..
.
. ..
..
.
.
.
.
...
.
.
.
..
.
..
.
.
. .
.
.
.
.
.
.
.
.
.
.
..
..
.
..
..
. ..
. ..
.
.
. .
..
.
..
.
..
..
.
..
.
.
.
. .
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
.
.
.
.
.
.
. ..
.
.
.
.
.
.
.
.
..
.
.
.
.
.
..
.
.
..
..
...
.
.
.
.
.
.
.
.
..
.
.
..
...
.
.
..
.
.
.
..
.
..
.
.
.
.
..
.
.
.
.
.
.
.
.
..
..
...
.
.
.
.
..
.
.
.
...
.
..
.
.. .
.
..
..
.
.
.
.
.
.
..
.
...
.
.
.
...
.
.
.
.
.
.
.
.
.
..
.
.
..
.
.. .
.
..
..
.
.
.
.
...
..
.
.... ....
.
..
..
.
.
.
.
.
..
.
. .
.
. .
.
.
...
.
.
.
.
.
..
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
..
.
..
.. ..
.
..
..
.
.
.
..
..
. .
...
.
.
..
.. .
.
.
.
.
.
.
.
...
.
..
.
.
.
.
.
.
.
..
.
..
.
.
.
.
.
.
..
.
.
.
..
..
..
.
. ..
..
.
.
..
..
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.. .
.
.
.
.. .
.
. .
..
.
..
.
..
.
.
.
.
.
.
.
.
..
.
.
.
.. ...
.
.
.
.
.
.
..
.
..
.
.
..
.
.. .. .
.
.
.
.
..
.
..
..
.
.
..
.
.
...
.
.
. ..
.
.
.
.
.
..
.
... .
.
.
..
. ..
.
.
..
.
.
.
.
.
.
.
.
..
.. .
..
.
. .
.
.
.
.
..
.
.
.
. .
.
. .
.
..
..
.
.
.
.
.
.
..
.
.
.
.
.
.
.
...
.
....
..
.
..
..
.
.
.
.
.
.
.
..
.
.
.
.
.
.
..
.
.
.
. .
..
.
.
..
.
. ..
..
.
. .
.
.
.
..
.
.
.
.
.
.
.
. .
.
.
.
.
...
.
.
.
..
.
..
..
.
.
.
.
. ..
.
.
...
.
..
.
.
.
..
...
.
.
.
.
.
.
.
. .
.
.
.
. .
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
...
..
.
.
.
.. .
.
.
.
.
. .
..
.
.. .
.
.
.
.
.
.
..
..
.
.
..
..
.
.
.
.
.
..
.
.
..
..
...
..
..
...
.
.
..
.
.
.
.
.
..
..
..
..
.
.
.
.
.
.
..
.
.
...
.
.
.
.
.
..
.
.
..
.
.
..
..
.
.
.
..
.
.
.
...
.
..
.
..
. ..
.
..
..
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
..
.
...
...
.
...
. .
.
. .
.
...
...
..
..
.
.
.
.
.
..
..
.
..
.
.
.
.
..
.
.
.
..
.
.
.
.
.
..
.
.
. ..
.
.
.
.
.
..
..
.
.
.
.
.
.
...
..
.
.
.
..
.
... .
.
.
.
..
.
.
.
.
.
.
.
. . .
.
....
.
.
..
.
..
...
.
.
.
...
.
.
.
.
.
.
.
..
..
.. .
.
.
.
.
.
.
..
..
....
.
.
.
..
...
.
.
.
.
..
. ..
.
.
.
...
.
..
.
..
.
..
.
.
.
.
. ..
..
.
.
.
.
..
.
.
.
.
.
..
.
.
.
.
.
.
.
... .
.
.
.
.
.
..
.
.
..
..
.
.. ..
..
.
..
.
.... .
. .
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.. .
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
..
.
..
.
.
.
.
..
..
.
..
.
.
..
.
...
.
..
.
.
.
..
.
. .
.
.
.
..
.
.
. .
..
.
..
.
.
.
.
..
.
..... .
...
.
.
..
.
.
..
.
.
.
. ..
.
.. .
..
.
.
.
.
.. .
..
. ..
..
.
....
.
..
.
.
.
.
.
.
..
.
...
.
.
.
.
.
.
..
.
.
.
..
.
.
.
.. .
.
.
.
.
. ..
.
.
.
...
.
.
.
.
.
.
.
.
.
.
. ..
. .
.
.
.
..
.
. .
.
.
.
.
.
..
.
.
.
..
.
..
.
.
..
. .
..
.
.
.
.
.
.
.
.
.
..
.
.
.
..
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
..
.
.
.
.
.
. .
..
...
.
.
.
.
.
..
.
. ...
.
.
.
.
..
.
.
..
..
..
..
.
.
.
.
.
.
.
.
.
.
..
.
.
..
.
.
.
.
.
.
.
.
.
.
.
..
.
.
..
.
. ..
.
.
..
...
.
.
.
.
.
...
.
..
.
.
..
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Model Fitting: Complex Posterior Distributions
Highly non-linear relationship among stellar parameters.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Model Fitting: Complex Posterior Distributions
Highly non-linear relationships among stellar parameters.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Model Fitting: Complex Posterior DistributionsMultiple Modes
The classification ofcertain stars as fieldor cluster stars can
cause multiplemodes in the
distributions of otherparameters.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Complex Posterior Distributions
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Complex Posterior Distributions
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
0.5
energy (keV)
post
erio
r den
sity
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Complex Posterior Distributions
2 4 6 8
24
68
energy of photon 1 (keV)
ener
gy o
f pho
ton
2 (k
eV)
energy of photon 2 (keV) energy of photon 1 (keV)
posterior
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Using Simulation to Evaluate Integrals
Suppose we want to compute
I =
∫g(θ)f (θ)dθ,
where f (θ) is a probability density function.If we have a sample
θ(1), . . . , θ(n)dist∼ f (θ),
we can estimate I with
In =1n
n∑i=1
g(θ(t)).
In this way we can compute means, variances, and theprobabilities of intervals.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
We Need to Obtain a Sample
Our primary goal:
Develop methods to obtain a sample from adistribution
The sample may be independent or dependent.Markov chains can be used to obtain a dependent sample.In a Bayesian context, we typically aim to sample theposterior distribution.
We first discuss an independent method:Rejection Sampling
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Rejection Sampling
Suppose we cannot sample f (θ) directly, but can find g(θ) with
f (θ) ≤ Mg(θ)
for some M.
1 Sample θ dist∼ g(θ).
2 Sample u dist∼ Unif (0,1).3 If
u ≤ f (θ)
Mg(θ), i.e., if uMg(θ) ≤ f (θ)
accept θ: θ(t) = θ.Otherwise reject θ and return to step 1.
How do we compute M?David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Rejection Sampling
Consider the distribution:
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.05
0.10
0.15
0.20
0.25
We must bound f (θ) with some unnormalized density, Mg(θ).
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Rejection Sampling
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.05
0.10
0.15
0.20
0.25
Mg(x)
Imagine that we sample uniformly in the red rectangle:
θdist∼ g(θ) and y = uMg(θ)
Accept samples that fall below the dashed density function.
How can we reduce the wait for acceptance??David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
Rejection Sampling
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.05
0.10
0.15
0.20
0.25
How can we reduce the wait for acceptance??
Improve g(θ) as an approximation to f (θ)!!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
What is a Markov Chain
A Markov chain is a sequence of random variables,
θ(0), θ(1), θ(2), . . .
such that
p(θ(t)|θ(t−1), θ(t−2), . . . , θ(0)) = p(θ(t)|θ(t−1)).
A Markov chain is generally constructed via
θ(t) = ϕ(θ(t−1),U(t−1))
with U(1),U(2), . . . independent.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Bayesian StatisticsMonte Carlo IntegrationMarkov Chains
What is a Stationary Distribution?
A stationary distribution is any distribution f (x) such that
f (θ(t)) =
∫p(θ(t)|θ(t−1))f (θ(t−1))dθ(t−1)
If we have a sample from the stationary dist’n and update theMarkov chain, the next iterate also follows the stationary dist’n.
What does a Markov Chain at Stationarity Deliver?Under regularity conditions, the density at iteration t ,
f (t)(θ|θ(0))→ f (θ) and1n
n∑t=1
h(θ(t))→ Ef [h(θ)]
We can treat {θ(t), t = N0, . . .N} as an approximate correlatedsample from the stationary distribution.
GOAL: Markov Chain with Stationary Dist’n = Target Dist’n.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Outline
1 BackgroundBayesian StatisticsMonte Carlo IntegrationMarkov Chains
2 Basic MCMC Jumping RulesMetropolis SamplerMetropolis Hastings Sampler
3 Practical Challenges and AdviceDiagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
4 Overview of Recommended Strategy
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Metropolis Sampler
Draw θ(0) from some starting distribution.
For t = 1,2,3, . . .Sample: θ∗ from Jt (θ
∗|θ(t−1))
Compute: r = p(θ∗|y)p(θ(t−1)|y)
Set: θ(t) =
{θ∗ with probability min(r ,1)
θ(t−1) otherwise
NoteJt must be symmetric: Jt (θ
∗|θ(t−1)) = Jt (θ(t−1)|θ∗).
If p(θ∗|y) > p(θ(t−1)|y), jump!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Random Walk Jumping Rule
Typical choices of Jt (θ∗|θ(t−1)) include
Unif (θ(t−1) − k , θ(t−1) + k)Normal (θ(t−1), kI)tdf(θ
(t−1), kI)Jt may change, but may not depend on the history of the chain.
theta
f(th
eta)
0 1 2 3 current value 5 6 7
0.00
0.05
0.10
0.15
0.20
0.25
How should we choose k? Replace I with M? How?David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
An Example
A simplified model for high-energy spectral analysis.
Model:Consider a perfect detector:
1 1000 energy bins, equally spaced from 0.3keV to 7.0keV,2 Yi
dist∼ Poisson(αE−β
i
), with θ = (α, β),
3 Ei is the energy, and
4 (α, β)
indep.dist∼ Unif(0,100).
The Sampler:We use a Gaussian Jumping Rule,
centered at the current sample, θ(t)
with standard deviations equal 0.08 and correlation zero.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Simulated Data
2288 counts were simulated with α = 5.0 and β = 1.69.
0 1 2 3 4 5 6 7
010
2030
4050
Energy
coun
ts
0 1 2 3 4 5 6 7
010
2030
4050
red curve−−expected counts
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Markov Chain Trace PlotsTime Series Plot for Metropolis Draws
iteration
α
0 100 200 300 400 500
5.0
5.5
6.0
6.5
iteration
β
0 100 200 300 400 500
1.6
1.8
2.0
2.2
2.4
Chains “stick” at a particular draw when proposals are rejected.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Joint Posterior Distribution
4.8 5.0 5.2 5.4 5.61.60
1.65
1.70
1.75
1.80
Scatter Plot of Posterior Distribution
α
β
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Marginal Posterior Dist’n of the Normalization
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for alpha Hist of 500 Draws excluding Burn−in
4.8 5.0 5.2 5.4 5.6
01
23
45
4.8 5.0 5.2 5.4 5.6
01
23
45
−− Posterior Density
E(α|Y ) ≈ 5.13, SD(α|Y ) ≈ 0.11, and a 95% CI is (4.92,5.41)
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Marginal Posterior Dist’n of Power Law Param
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for beta Hist of 500 Draws excluding Burn−in
1.60 1.65 1.70 1.75 1.80
05
1015
20
1.60 1.65 1.70 1.75 1.80
05
1015
20
−− Posterior Density
E(β|Y ) ≈ 1.71, SD(β|Y ) ≈ 0.03, and a 95% CI is (1.65,1.76)
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Metropolis-Hastings Sampler
A more general Jumping rule:
Draw θ(0) from some starting distribution.
For t = 1,2,3, . . .Sample: θ∗ from Jt (θ
∗|θ(t−1))
Compute: r = p(θ∗|y)/Jt (θ∗|θ(t−1))
p(θ(t−1)|y)/Jt (θ(t−1)|θ∗)
Set: θ(t) =
{θ∗ with probability min(r ,1)
θ(t−1) otherwise
NoteJt may be any jumping rule, it needn’t be symmetric.The updated r corrects for bias in the jumping rule.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Independence Sampler
Use an approximation to the posterior as the jumping rule:
Jt = Normald (MAP estimate, Curvature-based Variance Matrix).
MAP estimate = argmaxθp(θ|y)
Variance ≈[− ∂2
∂θ · ∂θlog p(θ|Y )
]−1
Note: Jt (θ∗|θ(t−1)) does not depend on θ(t−1).
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
The Independence Sampler
The Normal Approximation may not be adequate.
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.10
0.20
0.30
We can inflate the variance.We can use a heavy tailed distribution, e.g., lorentzian or t .
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Example of Independence Sampler
A simplified model for high-energy spectral analysis.
We can fit (α, β) with a general mode finder (e.g.,Levenberg-Marqardt)Requires coding likelihood (e.g. Cash statistic), specifingstarting values, etc.Base choice of parameter on quality of normal approx.
MLE is invariant to transformations.Variance matrix of transform is computed via delta method.
Can use the jumping rule:Jt = Normal2(MAP est, Curvature-based Variance Matrix).
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Markov Chain Trace PlotsTime Series Plot for Metropolis Hastings Draws
iteration
α
0 100 200 300 400 500
5.0
5.5
6.0
6.5
iteration
β
0 100 200 300 400 5001.6
1.8
2.0
2.2
2.4
Very little “sticking” here: acceptance rate is 98.8%.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Marginal Posterior Dist’n of the Normalization
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for alpha Hist of 500 Draws excluding Burn−in
4.8 5.0 5.2 5.4 5.6
01
23
45
4.8 5.0 5.2 5.4 5.6
01
23
45
−− Posterior Density
Autocorrelation is essentially zero: nearly independent sample!!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Metropolis SamplerMetropolis Hastings Sampler
Marginal Posterior Dist’n of Power Law Param
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for beta Hist of 500 Draws excluding Burn−in
1.60 1.65 1.70 1.75 1.80
05
1015
20
1.60 1.65 1.70 1.75 1.80
05
1015
20
−− Posterior Density
This result depends critically on access to a very goodapproximation to the posterior distribution.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Outline
1 BackgroundBayesian StatisticsMonte Carlo IntegrationMarkov Chains
2 Basic MCMC Jumping RulesMetropolis SamplerMetropolis Hastings Sampler
3 Practical Challenges and AdviceDiagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
4 Overview of Recommended Strategy
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Has this Chain Converged?
Image credit: Gelman (1995) In “MCMC in Practice” (Editors: Gilks, Richardson, and Spiegelhalter).
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Has this Chain Converged?
Image credit: Gelman (1995) In “MCMC in Practice” (Editors: Gilks, Richardson, and Spiegelhalter).
Comparing multiple chains can be informative!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Using Multiple Chains
chain 1
iteration
0 10000 20000
-22
46
logT
chain 2
iteration
0 10000 20000-2
24
6lo
gT
chain 3
iteration
0 10000 20000
-22
46
logT
Compare results of multiple chains to check convergence.Start the chains from distant points in parameter space.Run until they appear to give similar results
... or they find different solutions (multiple modes).
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
The Gelman and Rubin “R hat” Statistic
Consider M chains of length N: {ψnm,n = 1, . . . ,N}.
B =N
M − 1
M∑m−1
(ψ·m − ψ··)2
W =1M
M∑m=1
s2m where s2
m =1
N − 1
N∑n=1
(ψnm − ψ·m)2
Two estimates of Var(ψY ):1 W : underestimate of Var(ψ | Y ) for any finite N.2 var+(ψ | Y ) = N−1
N W + 1N B: overestimate of Var(ψ | Y ).
R =
√var+(ψ | Y )
W↓ 1 as the chains converge.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Choice of Jumping Rule with Random Walk Metropolis
Spectral Analysis: effect on burn in of power law parametersigma = 0.005, 0.08, 0.4
iteration
Sam
ple1
0 500 1000 15001.6
1.8
2.0
2.2
2.4
acceptance rate=87.5%Lag one autocorrelation=0.98
iteration
Sam
ple2
0 500 1000 15001.6
1.8
2.0
2.2
2.4
acceptance rate=31.6%Lag one autocorrelation=0.66
iteration
Sam
ple3
0 500 1000 1500
1.4
1.8
2.2
acceptance rate=3.1%Lag one autocorrelation=0.96
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Higher Acceptance Rate is not Always Better!
500 1000 1500 2000
1.62
1.66
1.70
1.74
sigma = 0.005, 0.08, 0.4
iteration
Sam
ple1
500 1000 1500 20001.60
1.70
1.80
iteration
Sam
ple2
500 1000 1500 20001.60
1.70
1.80
iteration
Sam
ple3
acceptance rate=3.1%
Lag one autocorrelation=0.96
Aim for 20% (vectors) - 40% (scalars) acceptance rateDavid A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Statistical Inference and Effective Sample Size
Point Estimate: hn = 1n∑
h(θ(t)) (estimate of E(h(θ)|x)!!)
Variance Estimate: Var(hn) ≈ σ2
n1+ρ1−ρ with (not var(θ)!!)
σ2 = Var(h(θ)) estimated by σ2 = 1n−1
∑nt=1[h(θ(t))− hn]2,
ρ = corr[h(θ(t),h(θ(t−1)
]estimated by
ρ =1
n − 1
∑nt=2[h(θ(t))− hn][h(θ(t−1))− hn]√∑n−1
t=1 [h(θ(t))− hn]2∑n
t=2[h(θ(t))− hn]2
Interval Estimate: hn ± td√
Var(hn) with d = n 1−ρ1+ρ − 1
The effective sample size is n 1−ρ1+ρ .
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Illustration of the Effective Sample Size
Sample from N(0,1)with random walk Metropolis with Jt = N(θ(t), σ).
What is the Effective Sample Size here? and σ?
0 200 400 600 800 1000
−3
−2
−1
01
23
iteration
Mar
kov
Cha
in
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Illustration of the Effective Sample Size
What is the Effective Sample Size here? and σ?
0 200 400 600 800 1000
−3
−2
−1
01
23
iteration
Mar
kov
Cha
in
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Illustration of the Effective Sample Size
What is the Effective Sample Size here? and σ?
0 200 400 600 800 1000
−3
−2
−1
01
23
iteration
Mar
kov
Cha
in
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Illustration of the Effective Sample Size
What is the Effective Sample Size here? and σ?
0 200 400 600 800 1000
−3
−2
−1
01
23
iteration
Mar
kov
Cha
in
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Lag One Autocorrelation
Small Jumps versus Low Acceptance Rates
−2 −1 0 1 2 3 4
0.7
0.8
0.9
1.0
log(sigma)
lag
1 au
toco
rrel
atio
n
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Effective Sample Size
Balancing the Trade-Off
−2 −1 0 1 2 3 4
050
100
150
200
log(sigma)
effe
ctiv
e sa
mpl
e si
ze
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Acceptance Rate
Bigger is not always Better!!
−2 −1 0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
log(sigma)
acce
ptan
ce r
ate
High acceptance rates only come with small steps!!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Finding the Optimal Acceptance Rate
−2 −1 0 1 2 3 4
050
100
150
200
log(sigma)
effe
ctiv
e sa
mpl
e si
ze
−2 −1 0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
log(sigma)
acce
ptan
ce r
ate
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Random Walk Metropolis with High Correlation
A whole new set of issues arise in higher dimensions...
Tradeoff between high autocorrelation and high rejection rate:more acute with high posterior correlationsmore acute with high dimensional parameter
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
x
y
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Random Walk Metropolis with High Correlation
In principle we can use a correlated jumping rule, butthe desired correlation may vary, andis often difficult to compute in advance.
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
x
y
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Random Walk Metropolis with High Correlation
What random walk jumping rule would you use here?
−3 −2 −1 0 1 2 3
−3
−2
−1
0
x
y
Remember: you don’t get to see the distribution in advance!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Parameters on Different Scales
Random Walk Metropolis for Spectral Analysis:
4.8 5.0 5.2 5.4 5.61.60
1.65
1.70
1.75
1.80
Scatter Plot of Posterior Distribution
α
β
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Autocorrelation for alpha Hist of 500 Draws excluding Burn−in
4.8 5.0 5.2 5.4 5.6
01
23
45
4.8 5.0 5.2 5.4 5.6
01
23
45
−− Posterior Density
Why is the Mixing SO Poor?!??David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Parameters on Different Scales
Consider the Scales of α and β:
4.8 5.0 5.2 5.4 5.61.60
1.65
1.70
1.75
1.80
Scatter Plot of Posterior Distribution
α
β
4.8 5.0 5.2 5.4 5.6
1.4
1.6
1.8
2.0
Scatter Plot of Posterior Distribution
α
β
A new jumping rule: std dev for α = 0.110, for β = 0.026, and corr = −0.216.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Improved Convergence
Original Jumping Rule:
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for alpha Hist of 500 Draws excluding Burn−in
4.8 5.0 5.2 5.4 5.6
01
23
45
4.8 5.0 5.2 5.4 5.6
01
23
45
−− Posterior Density
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Improved Convergence
Improved Jumping Rule:
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Autocorrelation for alpha Hist of 500 Draws excluding Burn−in
4.8 5.0 5.2 5.4 5.6
01
23
45
4.8 5.0 5.2 5.4 5.6
01
23
45
−− Posterior Density
Original Eff Sample Size = 19, Improved Eff Sample Size = 75, with n = 500.David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Parameters on Different Scales
Strategy: When using
Normal (θ(t−1), kM) or better yettdf(θ
(t−1), kM)
try using the variance-covariance matrix from a standard fittedmodel for M
... at least when there is standard mode-based model-fittingsoftware available.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Transforming to Normality
Parameter transformations can greatly improve MCMC.
Recall the Independence Sampler:
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.10
0.20
0.30
The normal approximation is not as good as we might hope...David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Transforming to Normality
But if we use the square root of θ:
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
sqrt(theta)
dens
ity o
f sqr
t(th
eta)
0 1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
1.0
1.2
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Transforming to Normality
And...
theta
f(th
eta)
0 1 2 3 4 5 6 7
0.00
0.10
0.20
0.30
sqrt(theta)
dens
ity o
f sqr
t(th
eta)
0 1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
The normal approximation is much improved!
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Transforming to Normality
Working with with Gaussian or symmetric distributions leads tomore efficient Metropolis and Metropolis Hastings Samplers.
General Strategy:Transform to the Real Line.Take the log of positive parameters.If the log is “too strong”, try square root.Probabilities can be transformed via the logit transform:
log(p/(1− p)).
More complex transformations for other quantities.Try out various transformations using an initial MCMC run.Statistical advantages to using normalizing transforms.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Removing Linear Correlations
Linear transformations can remove linear correlations
−3 −1 0 1 2 3
−10
−5
05
x
y
−3 −1 0 1 2 3
−0.
6−
0.4
−0.
20.
00.
20.
4
x
y −
3 *
x
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Removing Linear Correlations
... and can help with non-linear correlations.
0 2 4 6
050
100
150
x
y
0 2 4 6
−30
−20
−10
010
2030
x
y −
18
* x
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Multiple Modes
Scientific meaning ofmultiple modes.Do not focus only onthe major mode!“Important” modes.Challenging forBayesian andFrequentist methods.Consider Metropolis &Metropolis Hastings.Value of excessdispersion. −4 −2 0 2 4
−4
−2
02
4
x
y
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Diagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
Multiple Modes
1 Use a mode finder to “map out” the posterior distribution.1 Design a jumping rule that accounts for all of the modes.2 Run separate chains for each mode.
2 Use on of several sophisticated methods tailored formultiple modes.
1 Adaptive Metropolis Hastings. Jumping rule adapts whennew modes are found (van Dyk & Park, MCMC Hdbk 2011).
2 Parallel Tempering.3 Many other specialized methods.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Outline
1 BackgroundBayesian StatisticsMonte Carlo IntegrationMarkov Chains
2 Basic MCMC Jumping RulesMetropolis SamplerMetropolis Hastings Sampler
3 Practical Challenges and AdviceDiagnosing ConvergenceChoosing a Jumping RuleTransformations and Multiple Modes
4 Overview of Recommended Strategy
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Overview of Recommended Strategy
(Adopted from Bayesian Data Analysis, Section 11.10, Gelmanet al. (2005), Second Edition)
1 Start with a crude approximation to the posteriordistribution, perhaps using a mode finder.
2 Simulate directly, avoiding MCMC, if possible.3 If necessary use MCMC with one parameter at a time
updating or updating parameters in batches:Two-Step Gibbs Sampler:
Step 1: Sample θ(t) dist∼ p(θ | φ(t−1),Y )
Step 2: Sample φ(t) dist∼ p(φ | θ(t),Y )
4 Use Gibbs draws for closed form complete conditionals.
David A. van Dyk MCMC
uci
BackgroundBasic MCMC Jumping Rules
Practical Challenges and AdviceOverview of Recommended Strategy
Overview of Recommended Strategy- Con’t
5 Use metropolis jumps if complete conditional is not inclosed form. Tune variance of jumping distribution so thatacceptance rates are near 20% (for vector updates) or40% (for single parameter updates).
6 To improve convergence, use transformations so thatparameters are approximately independent.
7 Check for convergence using multiple chains.8 Compare inference based on crude approximation and
MCMC. If they are not similar, check for errors beforebelieving the results of the MCMC.
David A. van Dyk MCMC