MCMC Sampling for Dummies

7/26/2019 MCMC Sampling for Dummies

1/15

While My MCMC Gently Samples(http://twiecki.github.com/)

Bayesian modeling, Computational Psychiatry, and PythonRSS (/atom.xml)

About (https: //sit es.google.com/a/brown.edu/lncc/home/members/thomas-wiecki)

Archives (/archives.html)

Publications (http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-

F5Oqgc3UBzbTBAJACr4gTDyi09-

j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy

Contact

MCMC sampling for dummiesNov 10, 2015

When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treatin

it as a black box essentially. The beauty of probabilistic programming is that you actually don't have to understand how the inference works in order

build models, but it certainly helps.

When I presented a new Bayesian model to Quantopian's (https://quantopian.com) CEO, Fawce (https://quantopian.com/about), who wasn't trained

Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: "Thomas, how does the inference actually work? How

do we get these magical samples from the posterior?".

Now I could have said: "Well that's easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has a

its equilibrium distribution the target posterior distribution. Questions?".

That statement is correct, but is it useful? My pet peeve with how math and stats are taught is t hat no one ever tells you about the intuition behind th

concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hou

banging my head against the wall until that euraka moment came about. Usually things weren't as scary or seemingly complex once I deciphered what

meant.

This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the Metropolis algorith

(https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)). Critically, we'll be using code examples rather than formulas or math-spea

Eventually you'll need that but I personally think it's better to start with the an example and build the intuition before you move on to the math.

The problem and its unintuitive solution

Lets take a look at Bayes formula (https://en.wikipedia.org/wiki/Bayes%27_theorem):

We have , the probability of our model parameters given the data and thus our quantity of interest. To compute this we multiply the prior

(what we think about before we have seen any data) and the likelihood , i.e. how we think our data is distributed. This nominator is pretty easy

solve for.

However, lets take a closer look at the denominator. which is also called the evidence (i.e. the evidence that the data x was generated by th

model). We can compute this quantity by integrating over all possible parameter values:

This is the key difficulty with Bayes formula -- while the formula looks innocent enough, for even slightly non-trivial models you just can't compute th

posterior in a closed-form way.

Now we might say "OK, if we can't solve something, could we try to approximate it? For example, if we could somehow draw samples from that posteri

we can Monte Carlo approximate (https://en.wikipedia.org/wiki/Monte_Carlo_method) it." Unfortunately, to directly sample from that distribution you n

only have to solve Bayes formula, but also invert it, so that's even harder.

P ( | ) =

P ( | ) P ( )

P ( )

P ( | ) P

(

P( | )

P ( )

P ( ) = P ( , ) d
http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://twiecki.github.io/archives.htmlhttps://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttp://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.io/atom.xmlhttp://luu.lightquartrate.com/sd/apps/adinfo-1.1-p/index.html?bj1ETlNVbmxvY2tlciZoPWx1dS5saWdodHF1YXJ0cmF0ZS5jb20mYz1ncmVlbiZvPWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZkPSZ0PSZhPTk1NjAmcz0xMDA5Jnc9dHdpZWNraS5naXRodWIuaW8mb291PWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZiPTEmcmQ9JnJpPQ==https://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttps://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithmhttps://en.wikipedia.org/wiki/Monte_Carlo_methodhttp://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://void%280%29/https://quantopian.com/abouthttps://en.wikipedia.org/wiki/Bayes%27_theoremhttp://twiecki.github.io/archives.htmlhttps://quantopian.com/


2/15

Then we might say "Well, instead lets construct a Markov chain that has as an equilibrium distribution which matches our posterior distribution". I'm jus

kidding, most people wouldn't say that as it sounds bat-shit crazy. If you can't compute it, can't sample from it, then constructing that Markov chain wit

all these properties must be even harder.

The surprising insight though is that this is actually very easy and there exist a general class of algorithms that do this called Markov chain Monte Car

(https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) (constructing a Markov chain to do Monte Carlo approximation).

Setting up the problem

First, lets import our modules.

In [1]: %matplotlib inline

import numpy as npimport scipy as spimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns

from scipy.stats import norm

sns.set_style('white')sns.set_context('talk')

np.random.seed(123)

Lets generate some data: 100 points from a normal centered around zero. Our goal will be to estimate the posterior of the mean mu (we'll assume that w

know the standard deviation to be 1).

In [2]: data = np.random.randn(20)

In [3]: ax = plt.subplot()sns.distplot(data, kde=False, ax=ax)_ = ax.set(title='Histogram of observed data', xlabel='x', ylabel='# observations');

Next, we have to define our model. In this simple case, we will assume that this data is normal distributed, i.e. the likelihood of the model is normal. A

you know, a normal distribution has two parameters -- mean and standard deviati on . For simpl icit y, we'll assume we know that and we'll wa

to infer the posterior for . For each parameter we want to infer, we have to chose a prior. For simplicity, lets also ass ume a Normal distribution as a pri

for . Thus, in stats speak our model is:

= 1

N o r m a l ( 0 , 1 )

| N o r m a l ( ; , 1 )
https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo


3/15

What is convenient, is that for this model, we actually can compute the posterior analytically. That's because for a normal likelihood with known standar

deviation, the normal prior for mu is conjugate (https://en.wikipedia.org/wiki/Conjugate_prior) (conjugate here means that our posterior will follow the sam

distribution as the prior), so we know that our posterior for is also normal. We can easily look up on wikipedia how we can compute the parameters

the posterior. For a mathemtical derivation of this, see he

(http://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf).

In [4]: def calc_posterior_analytical(data, x, mu_0, sigma_0): sigma = 1. n = len(data) mu_post = (mu_0 / sigma_0**2 + data.sum() / sigma**2) / (1. / sigma_0**2 + n / sigma**2) sigma_post = (1. / sigma_0**2 + n / sigma**2)**1 return norm(mu_post, np.sqrt(sigma_post)).pdf(x)

ax = plt.subplot()x = np.linspace(1, 1, 500)posterior_analytical = calc_posterior_analytical(data, x, 0., 1.)ax.plot(x, posterior_analytical)ax.set(xlabel='mu', ylabel='belief', title='Analytical posterior');sns.despine()

This shows our quantity of interest, the probability of 's values after having seen the data, taking our prior information into account. Lets assum

however, that our prior wasn't conjugate and we couldn't solve this by hand which is usually the case.

Explaining MCMC sampling with code

Now on to the sampling logic. At first, you find starting parameter position (can be randomly chosen), lets fix it arbitrarily to:

mu_current = 1.

Then, you propose to move (jump) from that position somewhere else (that's the Markov part). You can be very dumb or very sophisticated about how yo

come up with that proposal. The Metropolis sampler is very dumb and just takes a sample from a normal distribution (no relationship to the normal w

assume for the model) centered around your current mu value (i.e. mu_current) with a certain standard deviation (proposal_width) that will determine how f

you propose jumps (here we're use scipy.stats.norm):

proposal = norm(mu_current, proposal_width).rvs()

Next, you evaluate whether that's a good place to jump to or not. If the resulting normal distribution with that proposed mu explaines the data better tha

your old mu, you'll definitely want to go there. What does "explains the data better" mean? We quantify fit by computing the probability of the data, give

the likelihood (normal) with the proposed parameter values (proposed mu and a fixed sigma = 1). This can easily be computed by calculating the probabili

for each data point using scipy.stats.normal(mu, sigma).pdf(data) and then multiplying the individual probabilities, i.e. compute the likelihood (usually yo

would use log probabilities but we omit this here):
https://en.wikipedia.org/wiki/Conjugate_priorhttp://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf


4/15

likelihood_current = norm(mu_current, 1).pdf(data).prod()

likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

# Compute prior probability of current and proposed mu

prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)

prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

# Nominator of Bayes formula

p_current = likelihood_current * prior_current

p_proposal = likelihood_proposal * prior_proposal

Up until now, we essentially have a hill-climbing algorithm that would just propose movements into random directions and only accept a jump if themu_proposal has higher likelihood than mu_current. Eventually we'll get to mu = 0 (or close to it) from where no more moves will be possible. However, w

want to get a posterior so we'll also have to sometimes accept moves into the other direction. The key trick is by dividing the two probabilities,

p_accept = p_proposal / p_current

we get an acceptance probability. You can already see that if p_proposal is larger, that probability will be > 1 and we'll definitely accept. However,

p_current is larger, say twice as large, there'll be a 50% chance of moving there:

accept = np.random.rand() < p_accept

if accept:

# Update position

cur_pos = proposal

This simple procedure gives us samples from the posterior.

Why does this make sense?

Taking a step back, note that the above acceptance ratio is the reason this whole thing works out and we get around the integration. We can show this b

computing the acceptance ratio over the normalized posterior and seeing how it's equivalent to the acceptance ratio of the unnormalized posterior (lets sa

is our current posit ion, and is our proposal):

In words, dividing the posterior of proposed parameter setting by the posterior of the current parameter setting, -- that nasty quantity we cacompute -- gets canceled out. So you can intuit that we're actually dividing the full posterior at one position by the full posterior at another position (n

magic here). That way, we are visiting regions of high posterior probability relatively more often than those of low posterior probability.

Putting it all together

In [5]: def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.): mu_current = mu_init posterior = [mu_current] for i in range(samples): # suggest new position mu_proposal = norm(mu_current, proposal_width).rvs()

# Compute likelihood by multiplying probabilities of each data point likelihood_current = norm(mu_current, 1).pdf(data).prod()

likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

# Compute prior probability of current and proposed muprior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)

prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

p_current = likelihood_current * prior_current p_proposal = likelihood_proposal * prior_proposal

# Accept proposal? p_accept = p_proposal / p_current

# Usually would include prior probability, which we neglect here for simplicity accept = np.random.rand() < p_accept

if plot: plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)

0

=

P ( | ) P ( )

P ( )

P ( | )

P( )

0

0

P ( )

P ( | ) P ( )

P ( | ) P ( )

0

0

P ( )


5/15

if accept: # Update position mu_current = mu_proposal

posterior.append(mu_current)

return posterior

# Function to displaydef plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accepted, trace, i): from copy import copy trace = copy(trace) fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 4)) fig.suptitle('Iteration %i' % (i + 1)) x = np.linspace(3, 3, 5000) color = 'g' if accepted else 'r'

# Plot prior prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current) prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal) prior = norm(mu_prior_mu, mu_prior_sd).pdf(x) ax1.plot(x, prior) ax1.plot([mu_current] * 2, [0, prior_current], marker='o', color='b') ax1.plot([mu_proposal] * 2, [0, prior_proposal], marker='o', color=color) ax1.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax1.set(ylabel='Probability Density', title='current: prior(mu=%.2f) = %.2f\nproposal: prior(mu=%.2f) = %.2f' % (mu_current, prior_current, mu_proposal, prior_proposal))

# Likelihood likelihood_current = norm(mu_current, 1).pdf(data).prod()

likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod() y = norm(loc=mu_proposal, scale=1).pdf(x) sns.distplot(data, kde=False, norm_hist=True, ax=ax2) ax2.plot(x, y, color=color) ax2.axvline(mu_current, color='b', linestyle='', label='mu_current') ax2.axvline(mu_proposal, color=color, linestyle='', label='mu_proposal') #ax2.title('Proposal {}'.format('accepted' if accepted else 'rejected')) ax2.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax2.set(title='likelihood(mu=%.2f) = %.2f\nlikelihood(mu=%.2f) = %.2f' % (mu_current, 1e14*likelihood_curret, mu_proposal, 1e14*likelihood_proposal))

# Posterior posterior_analytical = calc_posterior_analytical(data, x, mu_prior_mu, mu_prior_sd) ax3.plot(x, posterior_analytical) posterior_current = calc_posterior_analytical(data, mu_current, mu_prior_mu, mu_prior_sd) posterior_proposal = calc_posterior_analytical(data, mu_proposal, mu_prior_mu, mu_prior_sd)

ax3.plot([mu_current] * 2, [0, posterior_current], marker='o', color='b') ax3.plot([mu_proposal] * 2, [0, posterior_proposal], marker='o', color=color) ax3.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) #x3.set(title=r'prior x likelihood $\propto$ posterior') ax3.set(title='posterior(mu=%.2f) = %.5f\nposterior(mu=%.2f) = %.5f' % (mu_current, posterior_current, mu_poposal, posterior_proposal))

if accepted: trace.append(mu_proposal) else: trace.append(mu_current) ax4.plot(trace) ax4.set(xlabel='iteration', ylabel='mu', title='trace') plt.tight_layout() #plt.legend()

Visualizing MCMC

To visualize the sampling, we'll create plots for some quantities that are computed. Each row below is a single iteration through our Metropolis sampler.

The first columns is our prior distribution -- what our belief about is before seeing the data. You can see how the distribution is static and we only plug

our proposals. The vertic al lines represent our current in blue and our proposed in either red or green (rejected or accept ed, respectively).

The 2nd column is our likelihood and what we are using to evaluate how good our model explains the data. You can see that the likelihood functio

changes in response to the proposed . The blue histogram which is our data. The solid line in green or red is the likeli hood with the currently proposed m

Intuitively, the more overlap there is between likelihood and data, the better the model explains the data and the higher the resulting probability will be. Th

dotted line of the same color is the proposed mu and the dotted blue line is the current mu.


6/15

The 3rd column is our posterior distribution. Here I am displaying the normalized posterior but as we found out above, we can just multiply the prior valu

for the current and proposed 's by the likelihood value for the two 's to get the unnormalized posterior values (which we use for the actual computation

and divide one by the other to get our acceptance probability.

The 4th column is our trace (i.e. the posterior samples of we're generating) where we store each sample irrespect ive of whether it was accepted

rejected (in which case the line just stays constant).

Note that we always move to relatively more likely values (in terms of their posterior density), but only sometimes to relatively less likely values, a

can be seen in iteration 14 (the iteration number can be found at the top center of each row).

In [6]: np.random.seed(123)sampler(data, samples=8, mu_init=1., plot=True);


7/15

Now the magic of MCMC is that you just have to do that for a long time, and the samples that are generated in this way come from the posteri

distribution of your model. There is a rigorous mathematical proof that guarantees this which I won't go into detail here.

To get a sense of what this produces, lets draw a lot of samples and plot them.

In [7]: posterior = sampler(data, samples=15000, mu_init=1.)

fig, ax = plt.subplots()ax.plot(posterior)_ = ax.set(xlabel='sample', ylabel='mu');


8/15

This is usually called the trace. To now get an approxmation of the posterior (the reason why we're doing all this), we simply take the histogram of thi

trace. It's important to keep in mind that although this looks similar to the data we sampled above to fit the model, the two are completely separate. Th

below plot represents our belief in mu. In this case it just happens to also be normal but for a different model, it could have a completely different shap

than the likelihood or prior.

In [8]: ax = plt.subplot()

sns.distplot(posterior[500:], ax=ax, label='estimated posterior')x = np.linspace(.5, .5, 500)post = calc_posterior_analytical(data, x, 0, 1)ax.plot(x, post, 'g', label='analytic posterior')_ = ax.set(xlabel='mu', ylabel='belief');ax.legend();

As you can see, by following the above procedure, we get samples from t he s ame dis tribution as what we derived analytically .

Proposal width

Above we set the proposal width to 0.5. That turned out to be a pretty good value. In general you don't want the width to be too narrow because you

sampling will be inefficient as it takes a long time to explore the whole parameter space and shows the typical random-walk behavior:

In [9]: posterior_small = sampler(data, samples=5000, mu_init=1., proposal_width=.01)fig, ax = plt.subplots()ax.plot(posterior_small);_ = ax.set(xlabel='sample', ylabel='mu');


9/15

But you also don't want it to be so large that you never accept a jump:

In [10]: posterior_large = sampler(data, samples=5000, mu_init=1., proposal_width=3.)fig, ax = plt.subplots()ax.plot(posterior_large); plt.xlabel('sample'); plt.ylabel('mu');

_ = ax.set(xlabel='sample', ylabel='mu');

Note, however, that we are still sampling from our target posterior distribution here as guaranteed by the mathemtical proof, just less efficiently:

In [11]: sns.distplot(posterior_small[1000:], label='Small step size')sns.distplot(posterior_large[1000:], label='Large step size');_ = plt.legend();


10/15

With more samples this will eventually look like the true posterior. The key is that we want our samples to be independent of each other which cleary isn

the case here. Thus, one common metric to evaluate the efficiency of our sampler is the autocorrelation -- i.e. how correlated a sample i is to sample i1

i2, etc:

In [12]: from pymc3.stats import autocorrlags = np.arange(1, 100)fig, ax = plt.subplots()ax.plot(lags, [autocorr(posterior_large, l) for l in lags], label='large step size')ax.plot(lags, [autocorr(posterior_small, l) for l in lags], label='small step size')ax.plot(lags, [autocorr(posterior, l) for l in lags], label='medium step size')ax.legend(loc=0)_ = ax.set(xlabel='lag', ylabel='autocorrelation', ylim=(.1, 1))

Obviously we want to have a smart way of figuring out the right step width automatically. One common method is to keep adjusting the proposal width s

that roughly 50% proposals are rejected.

Extending to more complex models

Now you can easily imagine that we could also add a sigma parameter for the standard-deviation and follow the same procedure for this second paramete

In that case, we would be generating proposals for muand sigma but the algorithm logic would be nearly identical. Or, we could have data from a ve

different distribution like a Binomial and still use the same algorithm and get the correct posterior. That's pretty cool and a huge benefit of probabilisti

programming: Just define the model you want and let MCMC take care of the inference.

For example, the below model can be written in PyMC3 quite easily. Below we also use the Metropolis sampler (which automatically tunes the propos

width) and see that we get identical results. Feel free to play around with this and change the distributions. For more information, as well as more comple

examples, see the PyMC3 documentation (http://pymc-devs.github.io/pymc3/getting_started/).

In [13]: import pymc3 as pm

with pm.Model(): mu = pm.Normal('mu', 0, 1) sigma = 1. returns = pm.Normal('returns', mu=mu, sd=sigma, observed=data)

step = pm.Metropolis() trace = pm.sample(15000, step)sns.distplot(trace[2000:]['mu'], label='PyMC3 sampler');sns.distplot(posterior[500:], label='Handwritten sampler');plt.legend();

[100%] 15000 of 15000 complete in 1.7 sec
http://pymc-devs.github.io/pymc3/getting_started/


11/15

Conclusions

We glossed over a lot of detail which is certainly important but there are many other posts that deal with that. Here, we really wanted to communicate th

idea of MCMC and the Metropolis sampler. Hopefully you will have gathered some intuition which will equip you to read one of the more technicintroductions to this topic.

Other, more fancy, MCMC algorithms like Hamiltonian Monte Carlo actually work very similar to this, they are just much more clever in proposing where t

jump next.

This blog post was written in a Jupyter Notebook, you can find the underlying NB with all its code her

(https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb).

Posted by Thomas Wiecki Nov 10, 2015 bayesian statistics (http://twiecki.github.com/tag/bayesian-statistics.html)

Comments

21 Comments 1

Josh L Spinoza

This was extremely helpful. Thank you!

Charles

Very insightful. Great post, thanks!

Rob Hicks

This is a fantastic post and has helped my students in visualizing the sampling mechanism. I've adapted it for one of my lectur

and wanted you to know I had to modify the visualization, since the maximum of the likelihood function in the second column

of graphs in [6] is not constant. Since p(y|mu) is only maximized at 1 value of mu, the code (where you are plotting the

likelihood function- not the value of the likelihood at the proposal) should be something like this:

`y = np.array([norm(loc=i,scale=1).pdf(data).prod() for i in x])`

rather than

`y = norm(loc=mu_proposal, scale=1).pdf(x)`
http://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynbhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb


12/15

in the plot_proposal function.

Thomas Wiecki Mod

Thanks, I'm glad it's useful.

What I'm doing there is just plotting the likelihood function that will be used to evaluate the data, not the actual

likelihood at that point. But I can see how that's a bit misleading.

Josh L Spinoza

Hi Thomas, I tried commenting on one of the posts below but it has been closed. I'm really interested in Spike-

and-Slab priors as you mentioned below. Is there any way to implement this in PyMC2/3 ? I read a paper that

implements a Bernoulli for Spike and a Gaussian for the Slab . Can that be done using PyMC2/3 ?

Thomas Wiecki Mod

Hi Josh, Yes, can certainly be done. There's no example but it shouldn't be too hard for you to code it up

you do, please consider contributing it to pymc3 as an example.

Josh L Spinoza Thanks for the response! Is there a tutorial on adding new priors? If not, how does the function need to b

set up for use in PyMC3 ? I'm essentially trying to do this function in Python and PyMC3 seems to be the

best way! http://rpackages.ianhowson.com... I'm new at PyMC3 but Iunderstand Bayesian Inference. I

know Python really well so wanted to give it a shot before I give up and use the R package.

Thomas Wiecki Mod

Here is a description of how to add arbitrary distributions: http://pymc-devs.github.io/pym...

warrenon

Hey Thomas, great intro! Could you provide some links to the more technical sources you meantion in the conclusion? Thanks

Warren

Thomas Wiecki Mod

I liked this talk by Ian Murray:


13/15

Jun

Excellent post, this is the most intuitive explanation about MCMC I have ever read. I really like how you use code instead of

math to explain the algorithm.

Kevin

Excellent article! Just one quick question Thomas: Looking at the plot where you drew 15k samples of mu from the posterior,

why is it random in that range over multiple samples? Shouldnt the sample converge to a single value over samples rather than

be consistent within that range? For example, in your multi chart image, you can see the trace converge toward mu=0 even wi

a sample size of about 8! Shouldn't any large guesses after that NOT be accepted and thus every sample after that point be abo

0? In code term, I'm referring to the sampler function.

Thomas Wiecki Mod

The key is this line: `accept = np.random.rand() < p_accept`So yes, it's not guaranteed but with some probability the sampler will jump to a value that has less (unnormalized)

posterior probability. But certainly, very large guesses will be accepted only very rarely (that's why the sampler with the

large step width only accepts jump so rarely).

It's thus not "random" in the range, the trace plot only makes it look that way. Values closer to 0 are visited more often

(you can see that in the histogram).

Kevin

Thanks for the quick reply! That explains it well and that's what I was originally thinking as well. Another

question I've had was, correct me if i'm wrong, in terms of you saying that MCMC is able to approximate well

distributions that may not have a trivial posterior. However, in our example the normal distribution was used anthus made the posterior calculation trivial. In many cases, we might not exactly know the distribution that the

data is sampled from, would MCMC be able to still work when we don't assume that x | mu comes from a norm

distribution or do we have to ALWAYS specify the underlying distribution in the likelihood?

Thomas Wiecki Mod

Yes, MCMC will work no matter what likelihood or p riors you use. But you do have to specify them

somehow.

antiquechrono

I have a couple of questions that are bugging me if you don't mind answering.

When you say you are calculating the probability of each data point these aren't really probability values, but probability densi

values with units attached. How are you able to use the density (that doesn't sum to 1) in place of a probability value? Is it

because when you divide the current and proposed density the units cancel creating a real probability value?

Second, when most people are introduced to Bayes' Rule it's usually with a cancer example where you plug in probability value

and calculate the new "posterior" probability. How are you able to go from plugging probabilities into Bayes' Rule to plugging i

a likelihood and a distribution?

Thomas Wiecki Mod

1. Good question. It's probably easiest to think about the probability mass of an infinitesimally small region around thelikelihood value. See https://en.wikipedia.org/wiki/... for more info. But the normalization is a separate thing and also


14/15

Bayesian Deep Learning

Avat Update: I gave using Lasagne a try and it works

quite nicely, without any modifications. See here for an updated

NB:

Easily distributing a parallel IPython Notebook on a

cluster

Ava Nice post, Thomas. Similar to

ipython_cluster_helper are pyina and pathos, which provide

heterogeneous asynchronous parallel

A modern guide to getting started with Data Science and

Python

Animating MCMC with PyMC3 and Matplotlib

WHILE MY MCMC GENTLY SAMPLES

present in the discrete case.

2. Every model works with a likelihood function / distribution. In the cancer example you're using a Bernoulli distributi

(https://en.wikipedia.org/wiki/... as the likelihood, they just don't tell that's how it's called so that it's less scary :). Baye

formula stays the same in all cases though.

antiquechrono

Thank you for taking the time to reply.

1. Are you referring to the Baye's normalization here? I was asking more along the lines of when you calculate

p_accept = p_proposal / p_current. If for the sake of argument from the example in the post we say that mu is

measured in grams.

So when you calculate something like p_proposal it's not a normal probability value that ranges from 0-1. It's a

density which is basically the change in probability per unit which is not a vanilla probability. When you calculat

p_accept you get p_propsal g^-1 / p_current g^-1 so when you do the division you are left with a unit-less

quantity which is now a real probability value?

2. Well if you look at Bayes' Rule from your post you call the prior p(theta). I'm having a hard time understandi

how that goes from being a vanilla probability like 0.5 to being an entire distribution like Normal(0,1).

Charles

I don't think of p_accept as being a formally defined probability. I think of it more as just being a simple

comparison that tells whether the new value ( mu_proposal ) is better than the current one. In regards t

your second question: when you tack on any value to your mu_current list you are essentially generating

data from which to make a histogram out of. You can then use a density estimating function to

approximate the density over the range of values you think your parameter will have. It has been a while

since your question, so it may not be a question anymore. I hope this helps if it still is.

aloctavodia

Very nice post! I am been teaching structural bioinformatics (a branch of science that also uses MCMC methods) for biologist.

The previous course have been highly conceptual and not as hand-ons as I would like it. I have completely changed the course

for the next year I will be teaching them how to code and I have been preparing a MCMC chapter in line with your post. I thin

I can use the visualization part you have created :-) I also have been thinking on replicating a visualization like this one

http://blog.revolutionanalytic...

What about plotting the first 5 movement and then every 100 or so steps, to see the convergence of the chain? I think I will als

explored Ipython/Jupyter widgets (I have seen those, but I never has done anything "real").

Well, thanks for sharing!

Thomas Wiecki Mod

Thanks!

Good points, definitely feel free to use these graphs and/or code and resubmit any improvements you make.


15/15

Avat No, I mean in your translation here:

http://stackrefactoring.blogsp... If you could, right under the

title "A modern guide..." add: "For the

va .

marginal plots of the slice sampler are having this random-wal

behavior that slowly moves up and

Recent PostsBayesian Deep Learning (http://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/)

MCMC sampling for dummies (http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/)

A modern guide to getting s tarted with Data Sc ience and Pyt hon (http://twieck i.github.com/blog/2014/11/18/python-for-data-science/)

The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 (http://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/)

Easily distributing a parallel IPython Notebook on a cluster (http://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/)

Categoriesmisc (http://twiecki.github.com/category/misc.html)

Tagsbayesian statistics deep learning neural networks (http://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html), intro datascience

(http://twiecki.github.com/tag/intro-datascience.html), computation (http://twiecki.github.com/tag/computation.html), bayesian statistics

(http://twiecki.github.com/tag/bayesian-statistics.html)

GitHub Repospydata_docker_jupyterhub (https://github.com/twiecki/py data_docker_jupyterhub)

Docker container with a PyData stack and JupyterHub server

CythonGSL (https://github.com/twiecki/CythonGSL)

Cython interface for the GNU Scientific Library (GSL).

pydata_ninja (https://github.com/twiecki/ pydata_ninja)

The Path of the PyData Ninja

@twiecki (https://github.com/twiecki) on GitHub

Copyright 2013 - Thomas Wiecki - Powered by Pelican (http://getpelican.com)
http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/http://twiecki.github.com/tag/computation.htmlhttp://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/https://github.com/twiecki/CythonGSLhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttp://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/http://twiecki.github.com/category/misc.htmlhttps://github.com/twiecki/pydata_ninjahttps://github.com/twieckihttp://twiecki.github.com/blog/2014/11/18/python-for-data-science/http://getpelican.com/http://twiecki.github.com/tag/intro-datascience.htmlhttp://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/https://github.com/twiecki/pydata_docker_jupyterhubhttp://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html

MCMC Sampling for Dummies

Documents

Transcript of MCMC Sampling for Dummies