MCMC Sampling for Dummies

download MCMC Sampling for Dummies

of 15

Transcript of MCMC Sampling for Dummies

  • 7/26/2019 MCMC Sampling for Dummies

    1/15

    While My MCMC Gently Samples(http://twiecki.github.com/)

    Bayesian modeling, Computational Psychiatry, and PythonRSS (/atom.xml)

    About (https: //sit es.google.com/a/brown.edu/lncc/home/members/thomas-wiecki)

    Archives (/archives.html)

    Publications (http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-

    F5Oqgc3UBzbTBAJACr4gTDyi09-

    j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy

    Contact

    MCMC sampling for dummiesNov 10, 2015

    When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treatin

    it as a black box essentially. The beauty of probabilistic programming is that you actually don't have to understand how the inference works in order

    build models, but it certainly helps.

    When I presented a new Bayesian model to Quantopian's (https://quantopian.com) CEO, Fawce (https://quantopian.com/about), who wasn't trained

    Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: "Thomas, how does the inference actually work? How

    do we get these magical samples from the posterior?".

    Now I could have said: "Well that's easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has a

    its equilibrium distribution the target posterior distribution. Questions?".

    That statement is correct, but is it useful? My pet peeve with how math and stats are taught is t hat no one ever tells you about the intuition behind th

    concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hou

    banging my head against the wall until that euraka moment came about. Usually things weren't as scary or seemingly complex once I deciphered what

    meant.

    This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the Metropolis algorith

    (https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)). Critically, we'll be using code examples rather than formulas or math-spea

    Eventually you'll need that but I personally think it's better to start with the an example and build the intuition before you move on to the math.

    The problem and its unintuitive solution

    Lets take a look at Bayes formula (https://en.wikipedia.org/wiki/Bayes%27_theorem):

    We have , the probability of our model parameters given the data and thus our quantity of interest. To compute this we multiply the prior

    (what we think about before we have seen any data) and the likelihood , i.e. how we think our data is distributed. This nominator is pretty easy

    solve for.

    However, lets take a closer look at the denominator. which is also called the evidence (i.e. the evidence that the data x was generated by th

    model). We can compute this quantity by integrating over all possible parameter values:

    This is the key difficulty with Bayes formula -- while the formula looks innocent enough, for even slightly non-trivial models you just can't compute th

    posterior in a closed-form way.

    Now we might say "OK, if we can't solve something, could we try to approximate it? For example, if we could somehow draw samples from that posteri

    we can Monte Carlo approximate (https://en.wikipedia.org/wiki/Monte_Carlo_method) it." Unfortunately, to directly sample from that distribution you n

    only have to solve Bayes formula, but also invert it, so that's even harder.

    P ( | ) =

    P ( | ) P ( )

    P ( )

    P ( | ) P

    (

    P( | )

    P ( )

    P ( ) = P ( , ) d

    http://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://twiecki.github.io/archives.htmlhttps://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttp://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.com/http://twiecki.github.io/atom.xmlhttp://luu.lightquartrate.com/sd/apps/adinfo-1.1-p/index.html?bj1ETlNVbmxvY2tlciZoPWx1dS5saWdodHF1YXJ0cmF0ZS5jb20mYz1ncmVlbiZvPWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZkPSZ0PSZhPTk1NjAmcz0xMDA5Jnc9dHdpZWNraS5naXRodWIuaW8mb291PWh0dHA6Ly9rZHYuZGVjaXBoZXJpbmd3YXJucy5jb20vb3B0X291dC8xMSZiPTEmcmQ9JnJpPQ==https://sites.google.com/a/brown.edu/lncc/home/members/thomas-wieckihttps://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithmhttps://en.wikipedia.org/wiki/Monte_Carlo_methodhttp://scholar.google.com/citations?hl=en&user=s-Ikj-MAAAAJ&sortby=pubdate&view_op=list_works&gmla=AJsN-F5Oqgc3UBzbTBAJACr4gTDyi09-j1uryXtyXvDaEUrgtxiKmed0IIQlRvn9CHwFAcpHQB6ncpaBSY6vFsK6fazj3wmh6WLkuQdWdwuxd3uhwYN2kC8&undo=untrash_citations,W7OEmFMy1HYChttp://void%280%29/https://quantopian.com/abouthttps://en.wikipedia.org/wiki/Bayes%27_theoremhttp://twiecki.github.io/archives.htmlhttps://quantopian.com/
  • 7/26/2019 MCMC Sampling for Dummies

    2/15

    Then we might say "Well, instead lets construct a Markov chain that has as an equilibrium distribution which matches our posterior distribution". I'm jus

    kidding, most people wouldn't say that as it sounds bat-shit crazy. If you can't compute it, can't sample from it, then constructing that Markov chain wit

    all these properties must be even harder.

    The surprising insight though is that this is actually very easy and there exist a general class of algorithms that do this called Markov chain Monte Car

    (https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) (constructing a Markov chain to do Monte Carlo approximation).

    Setting up the problem

    First, lets import our modules.

    In [1]: %matplotlib inline

    import numpy as npimport scipy as spimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns

    from scipy.stats import norm

    sns.set_style('white')sns.set_context('talk')

    np.random.seed(123)

    Lets generate some data: 100 points from a normal centered around zero. Our goal will be to estimate the posterior of the mean mu (we'll assume that w

    know the standard deviation to be 1).

    In [2]: data = np.random.randn(20)

    In [3]: ax = plt.subplot()sns.distplot(data, kde=False, ax=ax)_ = ax.set(title='Histogram of observed data', xlabel='x', ylabel='# observations');

    Next, we have to define our model. In this simple case, we will assume that this data is normal distributed, i.e. the likelihood of the model is normal. A

    you know, a normal distribution has two parameters -- mean and standard deviati on . For simpl icit y, we'll assume we know that and we'll wa

    to infer the posterior for . For each parameter we want to infer, we have to chose a prior. For simplicity, lets also ass ume a Normal distribution as a pri

    for . Thus, in stats speak our model is:

    = 1

    N o r m a l ( 0 , 1 )

    | N o r m a l ( ; , 1 )

    https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo
  • 7/26/2019 MCMC Sampling for Dummies

    3/15

    What is convenient, is that for this model, we actually can compute the posterior analytically. That's because for a normal likelihood with known standar

    deviation, the normal prior for mu is conjugate (https://en.wikipedia.org/wiki/Conjugate_prior) (conjugate here means that our posterior will follow the sam

    distribution as the prior), so we know that our posterior for is also normal. We can easily look up on wikipedia how we can compute the parameters

    the posterior. For a mathemtical derivation of this, see he

    (http://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf).

    In [4]: def calc_posterior_analytical(data, x, mu_0, sigma_0): sigma = 1. n = len(data) mu_post = (mu_0 / sigma_0**2 + data.sum() / sigma**2) / (1. / sigma_0**2 + n / sigma**2) sigma_post = (1. / sigma_0**2 + n / sigma**2)**1 return norm(mu_post, np.sqrt(sigma_post)).pdf(x)

    ax = plt.subplot()x = np.linspace(1, 1, 500)posterior_analytical = calc_posterior_analytical(data, x, 0., 1.)ax.plot(x, posterior_analytical)ax.set(xlabel='mu', ylabel='belief', title='Analytical posterior');sns.despine()

    This shows our quantity of interest, the probability of 's values after having seen the data, taking our prior information into account. Lets assum

    however, that our prior wasn't conjugate and we couldn't solve this by hand which is usually the case.

    Explaining MCMC sampling with code

    Now on to the sampling logic. At first, you find starting parameter position (can be randomly chosen), lets fix it arbitrarily to:

    mu_current = 1.

    Then, you propose to move (jump) from that position somewhere else (that's the Markov part). You can be very dumb or very sophisticated about how yo

    come up with that proposal. The Metropolis sampler is very dumb and just takes a sample from a normal distribution (no relationship to the normal w

    assume for the model) centered around your current mu value (i.e. mu_current) with a certain standard deviation (proposal_width) that will determine how f

    you propose jumps (here we're use scipy.stats.norm):

    proposal = norm(mu_current, proposal_width).rvs()

    Next, you evaluate whether that's a good place to jump to or not. If the resulting normal distribution with that proposed mu explaines the data better tha

    your old mu, you'll definitely want to go there. What does "explains the data better" mean? We quantify fit by computing the probability of the data, give

    the likelihood (normal) with the proposed parameter values (proposed mu and a fixed sigma = 1). This can easily be computed by calculating the probabili

    for each data point using scipy.stats.normal(mu, sigma).pdf(data) and then multiplying the individual probabilities, i.e. compute the likelihood (usually yo

    would use log probabilities but we omit this here):

    https://en.wikipedia.org/wiki/Conjugate_priorhttp://www.bcs.rochester.edu/people/robbie/jacobslab/cheat_sheet/bayes_normal_normal.pdf
  • 7/26/2019 MCMC Sampling for Dummies

    4/15

    likelihood_current = norm(mu_current, 1).pdf(data).prod()

    likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

    # Compute prior probability of current and proposed mu

    prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)

    prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

    # Nominator of Bayes formula

    p_current = likelihood_current * prior_current

    p_proposal = likelihood_proposal * prior_proposal

    Up until now, we essentially have a hill-climbing algorithm that would just propose movements into random directions and only accept a jump if themu_proposal has higher likelihood than mu_current. Eventually we'll get to mu = 0 (or close to it) from where no more moves will be possible. However, w

    want to get a posterior so we'll also have to sometimes accept moves into the other direction. The key trick is by dividing the two probabilities,

    p_accept = p_proposal / p_current

    we get an acceptance probability. You can already see that if p_proposal is larger, that probability will be > 1 and we'll definitely accept. However,

    p_current is larger, say twice as large, there'll be a 50% chance of moving there:

    accept = np.random.rand() < p_accept

    if accept:

    # Update position

    cur_pos = proposal

    This simple procedure gives us samples from the posterior.

    Why does this make sense?

    Taking a step back, note that the above acceptance ratio is the reason this whole thing works out and we get around the integration. We can show this b

    computing the acceptance ratio over the normalized posterior and seeing how it's equivalent to the acceptance ratio of the unnormalized posterior (lets sa

    is our current posit ion, and is our proposal):

    In words, dividing the posterior of proposed parameter setting by the posterior of the current parameter setting, -- that nasty quantity we cacompute -- gets canceled out. So you can intuit that we're actually dividing the full posterior at one position by the full posterior at another position (n

    magic here). That way, we are visiting regions of high posterior probability relatively more often than those of low posterior probability.

    Putting it all together

    In [5]: def sampler(data, samples=4, mu_init=.5, proposal_width=.5, plot=False, mu_prior_mu=0, mu_prior_sd=1.): mu_current = mu_init posterior = [mu_current] for i in range(samples): # suggest new position mu_proposal = norm(mu_current, proposal_width).rvs()

    # Compute likelihood by multiplying probabilities of each data point likelihood_current = norm(mu_current, 1).pdf(data).prod()

    likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod()

    # Compute prior probability of current and proposed muprior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)

    prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)

    p_current = likelihood_current * prior_current p_proposal = likelihood_proposal * prior_proposal

    # Accept proposal? p_accept = p_proposal / p_current

    # Usually would include prior probability, which we neglect here for simplicity accept = np.random.rand() < p_accept

    if plot: plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accept, posterior, i)

    0

    =

    P ( | ) P ( )

    P ( )

    P ( | )

    P( )

    0

    0

    P ( )

    P ( | ) P ( )

    P ( | ) P ( )

    0

    0

    P ( )

  • 7/26/2019 MCMC Sampling for Dummies

    5/15

    if accept: # Update position mu_current = mu_proposal

    posterior.append(mu_current)

    return posterior

    # Function to displaydef plot_proposal(mu_current, mu_proposal, mu_prior_mu, mu_prior_sd, data, accepted, trace, i): from copy import copy trace = copy(trace) fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(16, 4)) fig.suptitle('Iteration %i' % (i + 1)) x = np.linspace(3, 3, 5000) color = 'g' if accepted else 'r'

    # Plot prior prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current) prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal) prior = norm(mu_prior_mu, mu_prior_sd).pdf(x) ax1.plot(x, prior) ax1.plot([mu_current] * 2, [0, prior_current], marker='o', color='b') ax1.plot([mu_proposal] * 2, [0, prior_proposal], marker='o', color=color) ax1.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax1.set(ylabel='Probability Density', title='current: prior(mu=%.2f) = %.2f\nproposal: prior(mu=%.2f) = %.2f' % (mu_current, prior_current, mu_proposal, prior_proposal))

    # Likelihood likelihood_current = norm(mu_current, 1).pdf(data).prod()

    likelihood_proposal = norm(mu_proposal, 1).pdf(data).prod() y = norm(loc=mu_proposal, scale=1).pdf(x) sns.distplot(data, kde=False, norm_hist=True, ax=ax2) ax2.plot(x, y, color=color) ax2.axvline(mu_current, color='b', linestyle='', label='mu_current') ax2.axvline(mu_proposal, color=color, linestyle='', label='mu_proposal') #ax2.title('Proposal {}'.format('accepted' if accepted else 'rejected')) ax2.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) ax2.set(title='likelihood(mu=%.2f) = %.2f\nlikelihood(mu=%.2f) = %.2f' % (mu_current, 1e14*likelihood_curret, mu_proposal, 1e14*likelihood_proposal))

    # Posterior posterior_analytical = calc_posterior_analytical(data, x, mu_prior_mu, mu_prior_sd) ax3.plot(x, posterior_analytical) posterior_current = calc_posterior_analytical(data, mu_current, mu_prior_mu, mu_prior_sd) posterior_proposal = calc_posterior_analytical(data, mu_proposal, mu_prior_mu, mu_prior_sd)

    ax3.plot([mu_current] * 2, [0, posterior_current], marker='o', color='b') ax3.plot([mu_proposal] * 2, [0, posterior_proposal], marker='o', color=color) ax3.annotate("", xy=(mu_proposal, 0.2), xytext=(mu_current, 0.2), arrowprops=dict(arrowstyle=">", lw=2.)) #x3.set(title=r'prior x likelihood $\propto$ posterior') ax3.set(title='posterior(mu=%.2f) = %.5f\nposterior(mu=%.2f) = %.5f' % (mu_current, posterior_current, mu_poposal, posterior_proposal))

    if accepted: trace.append(mu_proposal) else: trace.append(mu_current) ax4.plot(trace) ax4.set(xlabel='iteration', ylabel='mu', title='trace') plt.tight_layout() #plt.legend()

    Visualizing MCMC

    To visualize the sampling, we'll create plots for some quantities that are computed. Each row below is a single iteration through our Metropolis sampler.

    The first columns is our prior distribution -- what our belief about is before seeing the data. You can see how the distribution is static and we only plug

    our proposals. The vertic al lines represent our current in blue and our proposed in either red or green (rejected or accept ed, respectively).

    The 2nd column is our likelihood and what we are using to evaluate how good our model explains the data. You can see that the likelihood functio

    changes in response to the proposed . The blue histogram which is our data. The solid line in green or red is the likeli hood with the currently proposed m

    Intuitively, the more overlap there is between likelihood and data, the better the model explains the data and the higher the resulting probability will be. Th

    dotted line of the same color is the proposed mu and the dotted blue line is the current mu.

  • 7/26/2019 MCMC Sampling for Dummies

    6/15

    The 3rd column is our posterior distribution. Here I am displaying the normalized posterior but as we found out above, we can just multiply the prior valu

    for the current and proposed 's by the likelihood value for the two 's to get the unnormalized posterior values (which we use for the actual computation

    and divide one by the other to get our acceptance probability.

    The 4th column is our trace (i.e. the posterior samples of we're generating) where we store each sample irrespect ive of whether it was accepted

    rejected (in which case the line just stays constant).

    Note that we always move to relatively more likely values (in terms of their posterior density), but only sometimes to relatively less likely values, a

    can be seen in iteration 14 (the iteration number can be found at the top center of each row).

    In [6]: np.random.seed(123)sampler(data, samples=8, mu_init=1., plot=True);

  • 7/26/2019 MCMC Sampling for Dummies

    7/15

    Now the magic of MCMC is that you just have to do that for a long time, and the samples that are generated in this way come from the posteri

    distribution of your model. There is a rigorous mathematical proof that guarantees this which I won't go into detail here.

    To get a sense of what this produces, lets draw a lot of samples and plot them.

    In [7]: posterior = sampler(data, samples=15000, mu_init=1.)

    fig, ax = plt.subplots()ax.plot(posterior)_ = ax.set(xlabel='sample', ylabel='mu');

  • 7/26/2019 MCMC Sampling for Dummies

    8/15

    This is usually called the trace. To now get an approxmation of the posterior (the reason why we're doing all this), we simply take the histogram of thi

    trace. It's important to keep in mind that although this looks similar to the data we sampled above to fit the model, the two are completely separate. Th

    below plot represents our belief in mu. In this case it just happens to also be normal but for a different model, it could have a completely different shap

    than the likelihood or prior.

    In [8]: ax = plt.subplot()

    sns.distplot(posterior[500:], ax=ax, label='estimated posterior')x = np.linspace(.5, .5, 500)post = calc_posterior_analytical(data, x, 0, 1)ax.plot(x, post, 'g', label='analytic posterior')_ = ax.set(xlabel='mu', ylabel='belief');ax.legend();

    As you can see, by following the above procedure, we get samples from t he s ame dis tribution as what we derived analytically .

    Proposal width

    Above we set the proposal width to 0.5. That turned out to be a pretty good value. In general you don't want the width to be too narrow because you

    sampling will be inefficient as it takes a long time to explore the whole parameter space and shows the typical random-walk behavior:

    In [9]: posterior_small = sampler(data, samples=5000, mu_init=1., proposal_width=.01)fig, ax = plt.subplots()ax.plot(posterior_small);_ = ax.set(xlabel='sample', ylabel='mu');

  • 7/26/2019 MCMC Sampling for Dummies

    9/15

    But you also don't want it to be so large that you never accept a jump:

    In [10]: posterior_large = sampler(data, samples=5000, mu_init=1., proposal_width=3.)fig, ax = plt.subplots()ax.plot(posterior_large); plt.xlabel('sample'); plt.ylabel('mu');

    _ = ax.set(xlabel='sample', ylabel='mu');

    Note, however, that we are still sampling from our target posterior distribution here as guaranteed by the mathemtical proof, just less efficiently:

    In [11]: sns.distplot(posterior_small[1000:], label='Small step size')sns.distplot(posterior_large[1000:], label='Large step size');_ = plt.legend();

  • 7/26/2019 MCMC Sampling for Dummies

    10/15

    With more samples this will eventually look like the true posterior. The key is that we want our samples to be independent of each other which cleary isn

    the case here. Thus, one common metric to evaluate the efficiency of our sampler is the autocorrelation -- i.e. how correlated a sample i is to sample i1

    i2, etc:

    In [12]: from pymc3.stats import autocorrlags = np.arange(1, 100)fig, ax = plt.subplots()ax.plot(lags, [autocorr(posterior_large, l) for l in lags], label='large step size')ax.plot(lags, [autocorr(posterior_small, l) for l in lags], label='small step size')ax.plot(lags, [autocorr(posterior, l) for l in lags], label='medium step size')ax.legend(loc=0)_ = ax.set(xlabel='lag', ylabel='autocorrelation', ylim=(.1, 1))

    Obviously we want to have a smart way of figuring out the right step width automatically. One common method is to keep adjusting the proposal width s

    that roughly 50% proposals are rejected.

    Extending to more complex models

    Now you can easily imagine that we could also add a sigma parameter for the standard-deviation and follow the same procedure for this second paramete

    In that case, we would be generating proposals for muand sigma but the algorithm logic would be nearly identical. Or, we could have data from a ve

    different distribution like a Binomial and still use the same algorithm and get the correct posterior. That's pretty cool and a huge benefit of probabilisti

    programming: Just define the model you want and let MCMC take care of the inference.

    For example, the below model can be written in PyMC3 quite easily. Below we also use the Metropolis sampler (which automatically tunes the propos

    width) and see that we get identical results. Feel free to play around with this and change the distributions. For more information, as well as more comple

    examples, see the PyMC3 documentation (http://pymc-devs.github.io/pymc3/getting_started/).

    In [13]: import pymc3 as pm

    with pm.Model(): mu = pm.Normal('mu', 0, 1) sigma = 1. returns = pm.Normal('returns', mu=mu, sd=sigma, observed=data)

    step = pm.Metropolis() trace = pm.sample(15000, step)sns.distplot(trace[2000:]['mu'], label='PyMC3 sampler');sns.distplot(posterior[500:], label='Handwritten sampler');plt.legend();

    [100%] 15000 of 15000 complete in 1.7 sec

    http://pymc-devs.github.io/pymc3/getting_started/
  • 7/26/2019 MCMC Sampling for Dummies

    11/15

    Conclusions

    We glossed over a lot of detail which is certainly important but there are many other posts that deal with that. Here, we really wanted to communicate th

    idea of MCMC and the Metropolis sampler. Hopefully you will have gathered some intuition which will equip you to read one of the more technicintroductions to this topic.

    Other, more fancy, MCMC algorithms like Hamiltonian Monte Carlo actually work very similar to this, they are just much more clever in proposing where t

    jump next.

    This blog post was written in a Jupyter Notebook, you can find the underlying NB with all its code her

    (https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb).

    Posted by Thomas Wiecki Nov 10, 2015 bayesian statistics (http://twiecki.github.com/tag/bayesian-statistics.html)

    Comments

    21 Comments 1

    Josh L Spinoza

    This was extremely helpful. Thank you!

    Charles

    Very insightful. Great post, thanks!

    Rob Hicks

    This is a fantastic post and has helped my students in visualizing the sampling mechanism. I've adapted it for one of my lectur

    and wanted you to know I had to modify the visualization, since the maximum of the likelihood function in the second column

    of graphs in [6] is not constant. Since p(y|mu) is only maximized at 1 value of mu, the code (where you are plotting the

    likelihood function- not the value of the likelihood at the proposal) should be something like this:

    `y = np.array([norm(loc=i,scale=1).pdf(data).prod() for i in x])`

    rather than

    `y = norm(loc=mu_proposal, scale=1).pdf(x)`

    http://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynbhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttps://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/MCMC-sampling-for-dummies.ipynb
  • 7/26/2019 MCMC Sampling for Dummies

    12/15

    in the plot_proposal function.

    Thomas Wiecki Mod

    Thanks, I'm glad it's useful.

    What I'm doing there is just plotting the likelihood function that will be used to evaluate the data, not the actual

    likelihood at that point. But I can see how that's a bit misleading.

    Josh L Spinoza

    Hi Thomas, I tried commenting on one of the posts below but it has been closed. I'm really interested in Spike-

    and-Slab priors as you mentioned below. Is there any way to implement this in PyMC2/3 ? I read a paper that

    implements a Bernoulli for Spike and a Gaussian for the Slab . Can that be done using PyMC2/3 ?

    Thomas Wiecki Mod

    Hi Josh, Yes, can certainly be done. There's no example but it shouldn't be too hard for you to code it up

    you do, please consider contributing it to pymc3 as an example.

    Josh L Spinoza Thanks for the response! Is there a tutorial on adding new priors? If not, how does the function need to b

    set up for use in PyMC3 ? I'm essentially trying to do this function in Python and PyMC3 seems to be the

    best way! http://rpackages.ianhowson.com... I'm new at PyMC3 but Iunderstand Bayesian Inference. I

    know Python really well so wanted to give it a shot before I give up and use the R package.

    Thomas Wiecki Mod

    Here is a description of how to add arbitrary distributions: http://pymc-devs.github.io/pym...

    warrenon

    Hey Thomas, great intro! Could you provide some links to the more technical sources you meantion in the conclusion? Thanks

    Warren

    Thomas Wiecki Mod

    I liked this talk by Ian Murray:

  • 7/26/2019 MCMC Sampling for Dummies

    13/15

    Jun

    Excellent post, this is the most intuitive explanation about MCMC I have ever read. I really like how you use code instead of

    math to explain the algorithm.

    Kevin

    Excellent article! Just one quick question Thomas: Looking at the plot where you drew 15k samples of mu from the posterior,

    why is it random in that range over multiple samples? Shouldnt the sample converge to a single value over samples rather than

    be consistent within that range? For example, in your multi chart image, you can see the trace converge toward mu=0 even wi

    a sample size of about 8! Shouldn't any large guesses after that NOT be accepted and thus every sample after that point be abo

    0? In code term, I'm referring to the sampler function.

    Thomas Wiecki Mod

    The key is this line: `accept = np.random.rand() < p_accept`So yes, it's not guaranteed but with some probability the sampler will jump to a value that has less (unnormalized)

    posterior probability. But certainly, very large guesses will be accepted only very rarely (that's why the sampler with the

    large step width only accepts jump so rarely).

    It's thus not "random" in the range, the trace plot only makes it look that way. Values closer to 0 are visited more often

    (you can see that in the histogram).

    Kevin

    Thanks for the quick reply! That explains it well and that's what I was originally thinking as well. Another

    question I've had was, correct me if i'm wrong, in terms of you saying that MCMC is able to approximate well

    distributions that may not have a trivial posterior. However, in our example the normal distribution was used anthus made the posterior calculation trivial. In many cases, we might not exactly know the distribution that the

    data is sampled from, would MCMC be able to still work when we don't assume that x | mu comes from a norm

    distribution or do we have to ALWAYS specify the underlying distribution in the likelihood?

    Thomas Wiecki Mod

    Yes, MCMC will work no matter what likelihood or p riors you use. But you do have to specify them

    somehow.

    antiquechrono

    I have a couple of questions that are bugging me if you don't mind answering.

    When you say you are calculating the probability of each data point these aren't really probability values, but probability densi

    values with units attached. How are you able to use the density (that doesn't sum to 1) in place of a probability value? Is it

    because when you divide the current and proposed density the units cancel creating a real probability value?

    Second, when most people are introduced to Bayes' Rule it's usually with a cancer example where you plug in probability value

    and calculate the new "posterior" probability. How are you able to go from plugging probabilities into Bayes' Rule to plugging i

    a likelihood and a distribution?

    Thomas Wiecki Mod

    1. Good question. It's probably easiest to think about the probability mass of an infinitesimally small region around thelikelihood value. See https://en.wikipedia.org/wiki/... for more info. But the normalization is a separate thing and also

  • 7/26/2019 MCMC Sampling for Dummies

    14/15

    Bayesian Deep Learning

    Avat Update: I gave using Lasagne a try and it works

    quite nicely, without any modifications. See here for an updated

    NB:

    Easily distributing a parallel IPython Notebook on a

    cluster

    Ava Nice post, Thomas. Similar to

    ipython_cluster_helper are pyina and pathos, which provide

    heterogeneous asynchronous parallel

    A modern guide to getting started with Data Science and

    Python

    Animating MCMC with PyMC3 and Matplotlib

    WHILE MY MCMC GENTLY SAMPLES

    present in the discrete case.

    2. Every model works with a likelihood function / distribution. In the cancer example you're using a Bernoulli distributi

    (https://en.wikipedia.org/wiki/... as the likelihood, they just don't tell that's how it's called so that it's less scary :). Baye

    formula stays the same in all cases though.

    antiquechrono

    Thank you for taking the time to reply.

    1. Are you referring to the Baye's normalization here? I was asking more along the lines of when you calculate

    p_accept = p_proposal / p_current. If for the sake of argument from the example in the post we say that mu is

    measured in grams.

    So when you calculate something like p_proposal it's not a normal probability value that ranges from 0-1. It's a

    density which is basically the change in probability per unit which is not a vanilla probability. When you calculat

    p_accept you get p_propsal g^-1 / p_current g^-1 so when you do the division you are left with a unit-less

    quantity which is now a real probability value?

    2. Well if you look at Bayes' Rule from your post you call the prior p(theta). I'm having a hard time understandi

    how that goes from being a vanilla probability like 0.5 to being an entire distribution like Normal(0,1).

    Charles

    I don't think of p_accept as being a formally defined probability. I think of it more as just being a simple

    comparison that tells whether the new value ( mu_proposal ) is better than the current one. In regards t

    your second question: when you tack on any value to your mu_current list you are essentially generating

    data from which to make a histogram out of. You can then use a density estimating function to

    approximate the density over the range of values you think your parameter will have. It has been a while

    since your question, so it may not be a question anymore. I hope this helps if it still is.

    aloctavodia

    Very nice post! I am been teaching structural bioinformatics (a branch of science that also uses MCMC methods) for biologist.

    The previous course have been highly conceptual and not as hand-ons as I would like it. I have completely changed the course

    for the next year I will be teaching them how to code and I have been preparing a MCMC chapter in line with your post. I thin

    I can use the visualization part you have created :-) I also have been thinking on replicating a visualization like this one

    http://blog.revolutionanalytic...

    What about plotting the first 5 movement and then every 100 or so steps, to see the convergence of the chain? I think I will als

    explored Ipython/Jupyter widgets (I have seen those, but I never has done anything "real").

    Well, thanks for sharing!

    Thomas Wiecki Mod

    Thanks!

    Good points, definitely feel free to use these graphs and/or code and resubmit any improvements you make.

  • 7/26/2019 MCMC Sampling for Dummies

    15/15

    Avat No, I mean in your translation here:

    http://stackrefactoring.blogsp... If you could, right under the

    title "A modern guide..." add: "For the

    va .

    marginal plots of the slice sampler are having this random-wal

    behavior that slowly moves up and

    Recent PostsBayesian Deep Learning (http://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/)

    MCMC sampling for dummies (http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/)

    A modern guide to getting s tarted with Data Sc ience and Pyt hon (http://twieck i.github.com/blog/2014/11/18/python-for-data-science/)

    The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 (http://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/)

    Easily distributing a parallel IPython Notebook on a cluster (http://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/)

    Categoriesmisc (http://twiecki.github.com/category/misc.html)

    Tagsbayesian statistics deep learning neural networks (http://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html), intro datascience

    (http://twiecki.github.com/tag/intro-datascience.html), computation (http://twiecki.github.com/tag/computation.html), bayesian statistics

    (http://twiecki.github.com/tag/bayesian-statistics.html)

    GitHub Repospydata_docker_jupyterhub (https://github.com/twiecki/py data_docker_jupyterhub)

    Docker container with a PyData stack and JupyterHub server

    CythonGSL (https://github.com/twiecki/CythonGSL)

    Cython interface for the GNU Scientific Library (GSL).

    pydata_ninja (https://github.com/twiecki/ pydata_ninja)

    The Path of the PyData Ninja

    @twiecki (https://github.com/twiecki) on GitHub

    Copyright 2013 - Thomas Wiecki - Powered by Pelican (http://getpelican.com)

    http://twiecki.github.com/blog/2015/11/10/mcmc-sampling/http://twiecki.github.com/tag/computation.htmlhttp://twiecki.github.com/blog/2016/06/01/bayesian-deep-learning/https://github.com/twiecki/CythonGSLhttp://twiecki.github.com/tag/bayesian-statistics.htmlhttp://twiecki.github.com/blog/2014/02/24/ipython-nb-cluster/http://twiecki.github.com/category/misc.htmlhttps://github.com/twiecki/pydata_ninjahttps://github.com/twieckihttp://twiecki.github.com/blog/2014/11/18/python-for-data-science/http://getpelican.com/http://twiecki.github.com/tag/intro-datascience.htmlhttp://twiecki.github.com/blog/2014/03/17/bayesian-glms-3/https://github.com/twiecki/pydata_docker_jupyterhubhttp://twiecki.github.com/tag/bayesian-statistics-deep-learning-neural-networks.html