MCMC Sampling for Dummies
-
Upload
alankardutta -
Category
Documents
-
view
226 -
download
0
Transcript of MCMC Sampling for Dummies
-
7/26/2019 MCMC Sampling for Dummies
1/15
While My MCMC Gently Samples(http://twiecki.github.com/)
Bayesian modeling, Computational Psychiatry, and PythonRSS (/atom.xml)
About (https: //sit es.google.com/a/brown.edu/lncc/home/members/thomas-wiecki)
Archives (/archives.html)
Publications (http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-
F5Oqgc3UBzbTBAJACr4gTDyi09-
j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy
Contact
MCMC sampling for dummiesNov 10, 2015
When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treatin
it as a black box essentially. The beauty of probabilistic programming is that you actually don't have to understand how the inference works in order
build models, but it certainly helps.
When I presented a new Bayesian model to Quantopian's (https://quantopian.com) CEO, Fawce (https://quantopian.com/about), who wasn't trained
Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: "Thomas, how does the inference actually work? How
do we get these magical samples from the posterior?".
Now I could have said: "Well that's easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has a
its equilibrium distribution the target posterior distribution. Questions?".
That statement is correct, but is it useful? My pet peeve with how math and stats are taught is t hat no one ever tells you about the intuition behind th
concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hou
banging my head against the wall until that euraka moment came about. Usually things weren't as scary or seemingly complex once I deciphered what
meant.
This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the Metropolis algorith
(https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)). Critically, we'll be using code examples rather than formulas or math-spea
Eventually you'll need that but I personally think it's better to start with the an example and build the intuition before you move on to the math.
The problem and its unintuitive solution
Lets take a look at Bayes formula (https://en.wikipedia.org/wiki/Bayes%27_theorem):
We have , the probability of our model parameters given the data and thus our quantity of interest. To compute this we multiply the prior
(what we think about before we have seen any data) and the likelihood , i.e. how we think our data is distributed. This nominator is pretty easy
solve for.
However, lets take a closer look at the denominator. which is also called the evidence (i.e. the evidence that the data x was generated by th
model). We can compute this quantity by integrating over all possible parameter values:
This is the key difficulty with Bayes formula -- while the formula looks innocent enough, for even slightly non-trivial models you just can't compute th
posterior in a closed-form way.
Now we might say "OK, if we can't solve something, could we try to approximate it? For example, if we could somehow draw samples from that posteri
we can Monte Carlo approximate (https://en.wikipedia.org/wiki/Monte_Carlo_method) it." Unfortunately, to directly sample from that distribution you n
only have to solve Bayes formula, but also invert it, so that's even harder.
P ( | ) =
P ( | ) P ( )
P ( )
P ( | ) P
(
P( | )
P ( )
P ( ) = P ( , ) d
http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://twiecki.github.io/archives.htmlhttps://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttp://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.io/atom.xmlhttp://luu.lightquartrate.com/sd/apps/adinfo-1.1-p/index.html?bj1ETlNVbmxvY2tlciZoPWx1dS5saWdodHF1YXJ0cmF0ZS5jb20mYz1ncmVlbiZvPWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZkPSZ0PSZhPTk1NjAmcz0xMDA5Jnc9dHdpZWNraS5naXRodWIuaW8mb291PWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZiPTEmcmQ9JnJpPQ==https://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttps://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithmhttps://en.wikipedia.org/wiki/Monte_Carlo_methodhttp://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://void%280%29/https://quantopian.com/abouthttps://en.wikipedia.org/wiki/Bayes%27_theoremhttp://twiecki.github.io/archives.htmlhttps://quantopian.com/ -
7/26/2019 MCMC Sampling for Dummies
2/15
Then we might say "Well, instead lets construct a Markov chain that has as an equilibrium distribution which matches our posterior distribution". I'm jus
kidding, most people wouldn't say that as it sounds bat-shit crazy. If you can't compute it, can't sample from it, then constructing that Markov chain wit
all these properties must be even harder.
The surprising insight though is that this is actually very easy and there exist a general class of algorithms that do this called Markov chain Monte Car
(https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) (constructing a Markov chain to do Monte Carlo approximation).
Setting up the problem
First, lets import our modules.
In [1]: %matplotlib inline
import numpy as npimport scipy as spimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
from scipy.stats import norm
sns.set_style('white')sns.set_context('talk')
np.random.seed(123)
Lets generate some data: 100 points from a normal centered around zero. Our goal will be to estimate the posterior of the mean mu (we'll assume that w
know the standard deviation to be 1).
In [2]: data = np.random.randn(20)
In [3]: ax = plt.subplot()sns.distplot(data, kde=False, ax=ax)_ = ax.set(title='Histogram of observed data', xlabel='x', ylabel='# observations');
Next, we have to define our model. In this simple case, we will assume that this data is normal distributed, i.e. the likelihood of the model is normal. A
you know, a normal distribution has two parameters -- mean and standard deviati on . For simpl icit y, we'll assume we know that and we'll wa
to infer the posterior for . For each parameter we want to infer, we have to chose a prior. For simplicity, lets also ass ume a Normal distribution as a pri
for . Thus, in stats speak our model is:
= 1
N o r m a l ( 0 , 1 )
| N o r m a l ( ; , 1 )
https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo -
7/26/2019 MCMC Sampling for Dummies
3/15
What is convenient, is that for this model, we actually can compute the posterior analytically. That's because for a normal likelihood with known standar
deviation, the normal prior for mu is conjugate (https://en.wikipedia.org/wiki/Conjugate_prior) (conjugate here means that our posterior will follow the sam
distribution as the prior), so we know that our posterior for is also normal. We can easily look up on wikipedia how we can compute the parameters
the posterior. For a mathemtical derivation of this, see he
(http://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf).
In [4]: def calc_posterior_analytical(data, x, mu_0, sigma_0): sigma = 1. n = len(data) mu_post = (mu_0 / sigma_0**2 + data.sum() / sigma**2) / (1. / sigma_0**2 + n / sigma**2) sigma_post = (1. / sigma_0**2 + n / sigma**2)**1 return norm(mu_post, np.sqrt(sigma_post)).pdf(x)
ax = plt.subplot()x = np.linspace(1, 1, 500)posterior_analytical = calc_posterior_analytical(data, x, 0., 1.)ax.plot(x, posterior_analytical)ax.set(xlabel='mu', ylabel='belief', title='Analytical posterior');sns.despine()
This shows our quantity of interest, the probability of 's values after having seen the data, taking our prior information into account. Lets assum
however, that our prior wasn't conjugate and we couldn't solve this by hand which is usually the case.
Explaining MCMC sampling with code
Now on to the sampling logic. At first, you find starting parameter position (can be randomly chosen), lets fix it arbitrarily to:
mu_current = 1.
Then, you propose to move (jump) from that position somewhere else (that's the Markov part). You can be very dumb or very sophisticated about how yo
come up with that proposal. The Metropolis sampler is very dumb and just takes a sample from a normal distribution (no relationship to the normal w
assume for the model) centered around your current mu value (i.e. mu_current) with a certain standard deviation (proposal_width) that will determine how f
you propose jumps (here we're use scipy.stats.norm):
proposal = norm(mu_current, proposal_width).rvs()
Next, you evaluate whether that's a good place to jump to or not. If the resulting normal distribution with that proposed mu explaines the data better tha
your old mu, you'll definitely want to go there. What does "explains the data better" mean? We quantify fit by computing the probability of the data, give
the likelihood (normal) with the proposed parameter values (proposed mu and a fixed sigma = 1). This can easily be computed by calculating the probabili
for each data point using scipy.stats.normal(mu, sigma).pdf(data) and then multiplying the individual probabilities, i.e. compute the likelihood (usually yo
would use log probabilities but we omit this here):
https://en.wikipedia.org/wiki/Conjugate_priorhttp://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf -
7/26/2019 MCMC Sampling for Dummies
4/15
likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
# Compute prior probability of current and proposed mu
prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
# Nominator of Bayes formula
p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal
Up until now, we essentially have a hill-climbing algorithm that would just propose movements into random directions and only accept a jump if themu_proposal has higher likelihood than mu_current. Eventually we'll get to mu = 0 (or close to it) from where no more moves will be possible. However, w
want to get a posterior so we'll also have to sometimes accept moves into the other direction. The key trick is by dividing the two probabilities,
p_accept = p_proposal / p_current
we get an acceptance probability. You can already see that if p_proposal is larger, that probability will be > 1 and we'll definitely accept. However,
p_current is larger, say twice as large, there'll be a 50% chance of moving there:
accept = np.random.rand() < p_accept
if accept:
# Update position
cur_pos = proposal
This simple procedure gives us samples from the posterior.
Why does this make sense?
Taking a step back, note that the above acceptance ratio is the reason this whole thing works out and we get around the integration. We can show this b
computing the acceptance ratio over the normalized posterior and seeing how it's equivalent to the acceptance ratio of the unnormalized posterior (lets sa
is our current posit ion, and is our proposal):
In words, dividing the posterior of proposed parameter setting by the posterior of the current parameter setting, -- that nasty quantity we cacompute -- gets canceled out. So you can intuit that we're actually dividing the full posterior at one position by the full posterior at another position (n
magic here). That way, we are visiting regions of high posterior probability relatively more often than those of low posterior probability.
Putting it all together
In [5]: def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.): mu_current = mu_init posterior = [mu_current] for i in range(samples): # suggest new position mu_proposal = norm(mu_current, proposal_width).rvs()
# Compute likelihood by multiplying probabilities of each data point likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()
# Compute prior probability of current and proposed muprior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
p_current = likelihood_current * prior_current p_proposal = likelihood_proposal * prior_proposal
# Accept proposal? p_accept = p_proposal / p_current
# Usually would include prior probability, which we neglect here for simplicity accept = np.random.rand() < p_accept
if plot: plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)
0
=
P ( | ) P ( )
P ( )
P ( | )
P( )
0
0
P ( )
P ( | ) P ( )
P ( | ) P ( )
0
0
P ( )
-
7/26/2019 MCMC Sampling for Dummies
5/15
if accept: # Update position mu_current = mu_proposal
posterior.append(mu_current)
return posterior
# Function to displaydef plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accepted, trace, i): from copy import copy trace = copy(trace) fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 4)) fig.suptitle('Iteration %i' % (i + 1)) x = np.linspace(3, 3, 5000) color = 'g' if accepted else 'r'
# Plot prior prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current) prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal) prior = norm(mu_prior_mu, mu_prior_sd).pdf(x) ax1.plot(x, prior) ax1.plot([mu_current] * 2, [0, prior_current], marker='o', color='b') ax1.plot([mu_proposal] * 2, [0, prior_proposal], marker='o', color=color) ax1.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax1.set(ylabel='Probability Density', title='current: prior(mu=%.2f) = %.2f\nproposal: prior(mu=%.2f) = %.2f' % (mu_current, prior_current, mu_proposal, prior_proposal))
# Likelihood likelihood_current = norm(mu_current, 1).pdf(data).prod()
likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod() y = norm(loc=mu_proposal, scale=1).pdf(x) sns.distplot(data, kde=False, norm_hist=True, ax=ax2) ax2.plot(x, y, color=color) ax2.axvline(mu_current, color='b', linestyle='', label='mu_current') ax2.axvline(mu_proposal, color=color, linestyle='', label='mu_proposal') #ax2.title('Proposal {}'.format('accepted' if accepted else 'rejected')) ax2.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax2.set(title='likelihood(mu=%.2f) = %.2f\nlikelihood(mu=%.2f) = %.2f' % (mu_current, 1e14*likelihood_curret, mu_proposal, 1e14*likelihood_proposal))
# Posterior posterior_analytical = calc_posterior_analytical(data, x, mu_prior_mu, mu_prior_sd) ax3.plot(x, posterior_analytical) posterior_current = calc_posterior_analytical(data, mu_current, mu_prior_mu, mu_prior_sd) posterior_proposal = calc_posterior_analytical(data, mu_proposal, mu_prior_mu, mu_prior_sd)
ax3.plot([mu_current] * 2, [0, posterior_current], marker='o', color='b') ax3.plot([mu_proposal] * 2, [0, posterior_proposal], marker='o', color=color) ax3.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) #x3.set(title=r'prior x likelihood $\propto$ posterior') ax3.set(title='posterior(mu=%.2f) = %.5f\nposterior(mu=%.2f) = %.5f' % (mu_current, posterior_current, mu_poposal, posterior_proposal))
if accepted: trace.append(mu_proposal) else: trace.append(mu_current) ax4.plot(trace) ax4.set(xlabel='iteration', ylabel='mu', title='trace') plt.tight_layout() #plt.legend()
Visualizing MCMC
To visualize the sampling, we'll create plots for some quantities that are computed. Each row below is a single iteration through our Metropolis sampler.
The first columns is our prior distribution -- what our belief about is before seeing the data. You can see how the distribution is static and we only plug
our proposals. The vertic al lines represent our current in blue and our proposed in either red or green (rejected or accept ed, respectively).
The 2nd column is our likelihood and what we are using to evaluate how good our model explains the data. You can see that the likelihood functio
changes in response to the proposed . The blue histogram which is our data. The solid line in green or red is the likeli hood with the currently proposed m
Intuitively, the more overlap there is between likelihood and data, the better the model explains the data and the higher the resulting probability will be. Th
dotted line of the same color is the proposed mu and the dotted blue line is the current mu.
-
7/26/2019 MCMC Sampling for Dummies
6/15
The 3rd column is our posterior distribution. Here I am displaying the normalized posterior but as we found out above, we can just multiply the prior valu
for the current and proposed 's by the likelihood value for the two 's to get the unnormalized posterior values (which we use for the actual computation
and divide one by the other to get our acceptance probability.
The 4th column is our trace (i.e. the posterior samples of we're generating) where we store each sample irrespect ive of whether it was accepted
rejected (in which case the line just stays constant).
Note that we always move to relatively more likely values (in terms of their posterior density), but only sometimes to relatively less likely values, a
can be seen in iteration 14 (the iteration number can be found at the top center of each row).
In [6]: np.random.seed(123)sampler(data, samples=8, mu_init=1., plot=True);
-
7/26/2019 MCMC Sampling for Dummies
7/15
Now the magic of MCMC is that you just have to do that for a long time, and the samples that are generated in this way come from the posteri
distribution of your model. There is a rigorous mathematical proof that guarantees this which I won't go into detail here.
To get a sense of what this produces, lets draw a lot of samples and plot them.
In [7]: posterior = sampler(data, samples=15000, mu_init=1.)
fig, ax = plt.subplots()ax.plot(posterior)_ = ax.set(xlabel='sample', ylabel='mu');
-
7/26/2019 MCMC Sampling for Dummies
8/15
This is usually called the trace. To now get an approxmation of the posterior (the reason why we're doing all this), we simply take the histogram of thi
trace. It's important to keep in mind that although this looks similar to the data we sampled above to fit the model, the two are completely separate. Th
below plot represents our belief in mu. In this case it just happens to also be normal but for a different model, it could have a completely different shap
than the likelihood or prior.
In [8]: ax = plt.subplot()
sns.distplot(posterior[500:], ax=ax, label='estimated posterior')x = np.linspace(.5, .5, 500)post = calc_posterior_analytical(data, x, 0, 1)ax.plot(x, post, 'g', label='analytic posterior')_ = ax.set(xlabel='mu', ylabel='belief');ax.legend();
As you can see, by following the above procedure, we get samples from t he s ame dis tribution as what we derived analytically .
Proposal width
Above we set the proposal width to 0.5. That turned out to be a pretty good value. In general you don't want the width to be too narrow because you
sampling will be inefficient as it takes a long time to explore the whole parameter space and shows the typical random-walk behavior:
In [9]: posterior_small = sampler(data, samples=5000, mu_init=1., proposal_width=.01)fig, ax = plt.subplots()ax.plot(posterior_small);_ = ax.set(xlabel='sample', ylabel='mu');
-
7/26/2019 MCMC Sampling for Dummies
9/15
But you also don't want it to be so large that you never accept a jump:
In [10]: posterior_large = sampler(data, samples=5000, mu_init=1., proposal_width=3.)fig, ax = plt.subplots()ax.plot(posterior_large); plt.xlabel('sample'); plt.ylabel('mu');
_ = ax.set(xlabel='sample', ylabel='mu');
Note, however, that we are still sampling from our target posterior distribution here as guaranteed by the mathemtical proof, just less efficiently:
In [11]: sns.distplot(posterior_small[1000:], label='Small step size')sns.distplot(posterior_large[1000:], label='Large step size');_ = plt.legend();
-
7/26/2019 MCMC Sampling for Dummies
10/15
With more samples this will eventually look like the true posterior. The key is that we want our samples to be independent of each other which cleary isn
the case here. Thus, one common metric to evaluate the efficiency of our sampler is the autocorrelation -- i.e. how correlated a sample i is to sample i1
i2, etc:
In [12]: from pymc3.stats import autocorrlags = np.arange(1, 100)fig, ax = plt.subplots()ax.plot(lags, [autocorr(posterior_large, l) for l in lags], label='large step size')ax.plot(lags, [autocorr(posterior_small, l) for l in lags], label='small step size')ax.plot(lags, [autocorr(posterior, l) for l in lags], label='medium step size')ax.legend(loc=0)_ = ax.set(xlabel='lag', ylabel='autocorrelation', ylim=(.1, 1))
Obviously we want to have a smart way of figuring out the right step width automatically. One common method is to keep adjusting the proposal width s
that roughly 50% proposals are rejected.
Extending to more complex models
Now you can easily imagine that we could also add a sigma parameter for the standard-deviation and follow the same procedure for this second paramete
In that case, we would be generating proposals for muand sigma but the algorithm logic would be nearly identical. Or, we could have data from a ve
different distribution like a Binomial and still use the same algorithm and get the correct posterior. That's pretty cool and a huge benefit of probabilisti
programming: Just define the model you want and let MCMC take care of the inference.
For example, the below model can be written in PyMC3 quite easily. Below we also use the Metropolis sampler (which automatically tunes the propos
width) and see that we get identical results. Feel free to play around with this and change the distributions. For more information, as well as more comple
examples, see the PyMC3 documentation (http://pymc-devs.github.io/pymc3/getting_started/).
In [13]: import pymc3 as pm
with pm.Model(): mu = pm.Normal('mu', 0, 1) sigma = 1. returns = pm.Normal('returns', mu=mu, sd=sigma, observed=data)
step = pm.Metropolis() trace = pm.sample(15000, step)sns.distplot(trace[2000:]['mu'], label='PyMC3 sampler');sns.distplot(posterior[500:], label='Handwritten sampler');plt.legend();
[100%] 15000 of 15000 complete in 1.7 sec
http://pymc-devs.github.io/pymc3/getting_started/ -
7/26/2019 MCMC Sampling for Dummies
11/15
Conclusions
We glossed over a lot of detail which is certainly important but there are many other posts that deal with that. Here, we really wanted to communicate th
idea of MCMC and the Metropolis sampler. Hopefully you will have gathered some intuition which will equip you to read one of the more technicintroductions to this topic.
Other, more fancy, MCMC algorithms like Hamiltonian Monte Carlo actually work very similar to this, they are just much more clever in proposing where t
jump next.
This blog post was written in a Jupyter Notebook, you can find the underlying NB with all its code her
(https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb).
Posted by Thomas Wiecki Nov 10, 2015 bayesian statistics (http://twiecki.github.com/tag/bayesian-statistics.html)
Comments
21 Comments 1
Josh L Spinoza
This was extremely helpful. Thank you!
Charles
Very insightful. Great post, thanks!
Rob Hicks
This is a fantastic post and has helped my students in visualizing the sampling mechanism. I've adapted it for one of my lectur
and wanted you to know I had to modify the visualization, since the maximum of the likelihood function in the second column
of graphs in [6] is not constant. Since p(y|mu) is only maximized at 1 value of mu, the code (where you are plotting the
likelihood function- not the value of the likelihood at the proposal) should be something like this:
`y = np.array([norm(loc=i,scale=1).pdf(data).prod() for i in x])`
rather than
`y = norm(loc=mu_proposal, scale=1).pdf(x)`
http://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynbhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb -
7/26/2019 MCMC Sampling for Dummies
12/15
in the plot_proposal function.
Thomas Wiecki Mod
Thanks, I'm glad it's useful.
What I'm doing there is just plotting the likelihood function that will be used to evaluate the data, not the actual
likelihood at that point. But I can see how that's a bit misleading.
Josh L Spinoza
Hi Thomas, I tried commenting on one of the posts below but it has been closed. I'm really interested in Spike-
and-Slab priors as you mentioned below. Is there any way to implement this in PyMC2/3 ? I read a paper that
implements a Bernoulli for Spike and a Gaussian for the Slab . Can that be done using PyMC2/3 ?
Thomas Wiecki Mod
Hi Josh, Yes, can certainly be done. There's no example but it shouldn't be too hard for you to code it up
you do, please consider contributing it to pymc3 as an example.
Josh L Spinoza Thanks for the response! Is there a tutorial on adding new priors? If not, how does the function need to b
set up for use in PyMC3 ? I'm essentially trying to do this function in Python and PyMC3 seems to be the
best way! http://rpackages.ianhowson.com... I'm new at PyMC3 but Iunderstand Bayesian Inference. I
know Python really well so wanted to give it a shot before I give up and use the R package.
Thomas Wiecki Mod
Here is a description of how to add arbitrary distributions: http://pymc-devs.github.io/pym...
warrenon
Hey Thomas, great intro! Could you provide some links to the more technical sources you meantion in the conclusion? Thanks
Warren
Thomas Wiecki Mod
I liked this talk by Ian Murray:
-
7/26/2019 MCMC Sampling for Dummies
13/15
Jun
Excellent post, this is the most intuitive explanation about MCMC I have ever read. I really like how you use code instead of
math to explain the algorithm.
Kevin
Excellent article! Just one quick question Thomas: Looking at the plot where you drew 15k samples of mu from the posterior,
why is it random in that range over multiple samples? Shouldnt the sample converge to a single value over samples rather than
be consistent within that range? For example, in your multi chart image, you can see the trace converge toward mu=0 even wi
a sample size of about 8! Shouldn't any large guesses after that NOT be accepted and thus every sample after that point be abo
0? In code term, I'm referring to the sampler function.
Thomas Wiecki Mod
The key is this line: `accept = np.random.rand() < p_accept`So yes, it's not guaranteed but with some probability the sampler will jump to a value that has less (unnormalized)
posterior probability. But certainly, very large guesses will be accepted only very rarely (that's why the sampler with the
large step width only accepts jump so rarely).
It's thus not "random" in the range, the trace plot only makes it look that way. Values closer to 0 are visited more often
(you can see that in the histogram).
Kevin
Thanks for the quick reply! That explains it well and that's what I was originally thinking as well. Another
question I've had was, correct me if i'm wrong, in terms of you saying that MCMC is able to approximate well
distributions that may not have a trivial posterior. However, in our example the normal distribution was used anthus made the posterior calculation trivial. In many cases, we might not exactly know the distribution that the
data is sampled from, would MCMC be able to still work when we don't assume that x | mu comes from a norm
distribution or do we have to ALWAYS specify the underlying distribution in the likelihood?
Thomas Wiecki Mod
Yes, MCMC will work no matter what likelihood or p riors you use. But you do have to specify them
somehow.
antiquechrono
I have a couple of questions that are bugging me if you don't mind answering.
When you say you are calculating the probability of each data point these aren't really probability values, but probability densi
values with units attached. How are you able to use the density (that doesn't sum to 1) in place of a probability value? Is it
because when you divide the current and proposed density the units cancel creating a real probability value?
Second, when most people are introduced to Bayes' Rule it's usually with a cancer example where you plug in probability value
and calculate the new "posterior" probability. How are you able to go from plugging probabilities into Bayes' Rule to plugging i
a likelihood and a distribution?
Thomas Wiecki Mod
1. Good question. It's probably easiest to think about the probability mass of an infinitesimally small region around thelikelihood value. See https://en.wikipedia.org/wiki/... for more info. But the normalization is a separate thing and also
-
7/26/2019 MCMC Sampling for Dummies
14/15
Bayesian Deep Learning
Avat Update: I gave using Lasagne a try and it works
quite nicely, without any modifications. See here for an updated
NB:
Easily distributing a parallel IPython Notebook on a
cluster
Ava Nice post, Thomas. Similar to
ipython_cluster_helper are pyina and pathos, which provide
heterogeneous asynchronous parallel
A modern guide to getting started with Data Science and
Python
Animating MCMC with PyMC3 and Matplotlib
WHILE MY MCMC GENTLY SAMPLES
present in the discrete case.
2. Every model works with a likelihood function / distribution. In the cancer example you're using a Bernoulli distributi
(https://en.wikipedia.org/wiki/... as the likelihood, they just don't tell that's how it's called so that it's less scary :). Baye
formula stays the same in all cases though.
antiquechrono
Thank you for taking the time to reply.
1. Are you referring to the Baye's normalization here? I was asking more along the lines of when you calculate
p_accept = p_proposal / p_current. If for the sake of argument from the example in the post we say that mu is
measured in grams.
So when you calculate something like p_proposal it's not a normal probability value that ranges from 0-1. It's a
density which is basically the change in probability per unit which is not a vanilla probability. When you calculat
p_accept you get p_propsal g^-1 / p_current g^-1 so when you do the division you are left with a unit-less
quantity which is now a real probability value?
2. Well if you look at Bayes' Rule from your post you call the prior p(theta). I'm having a hard time understandi
how that goes from being a vanilla probability like 0.5 to being an entire distribution like Normal(0,1).
Charles
I don't think of p_accept as being a formally defined probability. I think of it more as just being a simple
comparison that tells whether the new value ( mu_proposal ) is better than the current one. In regards t
your second question: when you tack on any value to your mu_current list you are essentially generating
data from which to make a histogram out of. You can then use a density estimating function to
approximate the density over the range of values you think your parameter will have. It has been a while
since your question, so it may not be a question anymore. I hope this helps if it still is.
aloctavodia
Very nice post! I am been teaching structural bioinformatics (a branch of science that also uses MCMC methods) for biologist.
The previous course have been highly conceptual and not as hand-ons as I would like it. I have completely changed the course
for the next year I will be teaching them how to code and I have been preparing a MCMC chapter in line with your post. I thin
I can use the visualization part you have created :-) I also have been thinking on replicating a visualization like this one
http://blog.revolutionanalytic...
What about plotting the first 5 movement and then every 100 or so steps, to see the convergence of the chain? I think I will als
explored Ipython/Jupyter widgets (I have seen those, but I never has done anything "real").
Well, thanks for sharing!
Thomas Wiecki Mod
Thanks!
Good points, definitely feel free to use these graphs and/or code and resubmit any improvements you make.
-
7/26/2019 MCMC Sampling for Dummies
15/15
Avat No, I mean in your translation here:
http://stackrefactoring.blogsp... If you could, right under the
title "A modern guide..." add: "For the
va .
marginal plots of the slice sampler are having this random-wal
behavior that slowly moves up and
Recent PostsBayesian Deep Learning (http://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/)
MCMC sampling for dummies (http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/)
A modern guide to getting s tarted with Data Sc ience and Pyt hon (http://twieck i.github.com/blog/2014/11/18/python-for-data-science/)
The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 (http://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/)
Easily distributing a parallel IPython Notebook on a cluster (http://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/)
Categoriesmisc (http://twiecki.github.com/category/misc.html)
Tagsbayesian statistics deep learning neural networks (http://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html), intro datascience
(http://twiecki.github.com/tag/intro-datascience.html), computation (http://twiecki.github.com/tag/computation.html), bayesian statistics
(http://twiecki.github.com/tag/bayesian-statistics.html)
GitHub Repospydata_docker_jupyterhub (https://github.com/twiecki/py data_docker_jupyterhub)
Docker container with a PyData stack and JupyterHub server
CythonGSL (https://github.com/twiecki/CythonGSL)
Cython interface for the GNU Scientific Library (GSL).
pydata_ninja (https://github.com/twiecki/ pydata_ninja)
The Path of the PyData Ninja
@twiecki (https://github.com/twiecki) on GitHub
Copyright 2013 - Thomas Wiecki - Powered by Pelican (http://getpelican.com)
http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/http://twiecki.github.com/tag/computation.htmlhttp://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/https://github.com/twiecki/CythonGSLhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttp://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/http://twiecki.github.com/category/misc.htmlhttps://github.com/twiecki/pydata_ninjahttps://github.com/twieckihttp://twiecki.github.com/blog/2014/11/18/python-for-data-science/http://getpelican.com/http://twiecki.github.com/tag/intro-datascience.htmlhttp://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/https://github.com/twiecki/pydata_docker_jupyterhubhttp://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html