Gentle Introduction to MCMC

download Gentle Introduction to MCMC

of 85

Transcript of Gentle Introduction to MCMC

  • 8/13/2019 Gentle Introduction to MCMC

    1/85

    Lecture I

    A Gentle Introduction toMarkov Chain Monte Carlo (MCMC)

    Ed GeorgeUniversity of Pennsylvania

    Seminaire de PrintempsVillars-sur-Ollon, Switzerland

    March 2005

    1

  • 8/13/2019 Gentle Introduction to MCMC

    2/85

    1

    1. MCMC: A New Approach to Simulation

    Consider the general problem of trying to calculate characteristicsof a complicated multivariate probability distribution f(x) onx =(x1, . . . , xp).

    For example, suppose we want to calculate the mean ofx1,

    x1f(x1, x2)dx1dx2

    where

    f(x1, x2) (1 +x21)1xn2 exp

    1

    2x22 i(yi x1)

    2 x2

    (y1, . . . , yn are fixed constants). Bad news: This calculation isanalytically intractable.

    2

  • 8/13/2019 Gentle Introduction to MCMC

    3/85

    A Monte Carlo approach: Simulate k observations x(1), . . . , x(k)

    from f(x) and use this sample to estimate the characteristics of

    interest. (Careful: Each x

    (j)

    = (x

    (j)

    1 , . . . , x

    (j)

    p ) is a multivariateobservation). For example, we could estimate the mean ofx1 by

    x1 = 1

    k

    j

    x(j)1 .

    If x(1), . . . , x(k) were independent observations (i.e. an iid sam-ple), we could use standard central limit theorem results to drawinference about the quality of our estimate.

    Bad news: In many problems, methods are unavailable for directsimulation of an iid sample from f(x).

    3

  • 8/13/2019 Gentle Introduction to MCMC

    4/85

    Good news: In many problems, methods such as the Gibbs sam-pler and the Metropolis-Hastings algorithms can be used to sim-ulate a Markov chain x(1), . . . , x(k) which is converging in distri-

    bution to f(x), (i.e. as k increases, the distribution of x(k) getscloser and closer to f(x)).

    Recall that a Markov chain x(1), . . . , x(k) is a sequence such thatfor each j 1, x(j+1) is sampled from a distribution p(x | x(j))

    which depends on x

    (j)

    (but not on x

    (1)

    , . . . , x

    (j1)

    ).

    The function p(x | x(j)) is called a Markov transition kernel. Ifp(x |x(j)) is time-homogeneous (i.e. p(x |x(j)) does not depend onj) and the transition kernel satisfies

    p(x | x)f(x)dx =f(x),

    then the chain will converge to f(x) if it converges at all.

    4

  • 8/13/2019 Gentle Introduction to MCMC

    5/85

    Simulation of a Markov chain requires a starting value x(0). If thechain is converging to f(x), then the dependence betweenx(j) andx(0) diminishes as j increases. After a suitable burn in periodofliterations, x(l), . . . , x(k) behaves like a dependent sample fromf(x).

    Such behavior is illustrated by Figure 1.1 on page 6 of Gilks,Richardson & Spieglehalter (1995).

    The output from such simulated chains can be used to estimatethe characteristics off(x). For example, one can obtain approxi-mate iid samples of size m by taking the final x(k) values from mseparate chains.

    It is probably more efficient, however, to use all the simulated

    values. For example, x1 = 1k

    jx

    (j)1 will still converge to the

    mean ofx1.

    5

  • 8/13/2019 Gentle Introduction to MCMC

    6/85

  • 8/13/2019 Gentle Introduction to MCMC

    7/85

    MCMC is the general procedure of simulating such Markov chainsand using them to draw inference about the characteristics off(x).

    Methods which have ignited MCMC are the Gibbs sampler andthe more general Metropolis-Hastings algorithms. As will we nowsee, these are simply prescriptions for constructing a Markov tran-sition kernelp(x|x) which generates a Markov chainx(1), . . . , x(k)

    converging to f(x).

    2. The Gibbs Sampler (GS)

    The GS is an algorithm for simulating a Markov chainx(1), . . . , x(k)

    which is converging to f(x), by successively sampling from the fullconditional component distributionsf(xi|xi),i = 1, . . . , p, wherex

    idenotes the components ofx other than x

    i.

    6

  • 8/13/2019 Gentle Introduction to MCMC

    8/85

    For simplicity, consider the case where p = 2. The GS generatesa Markov chain

    (x(1)1 , x(1)2 ), (x(2)1 , x(2)2 ), . . . , (x(k)1 , x(

    k)2 )

    converging to f(x1, x2), by successively sampling

    x(1)1 from f(x1 | x

    (0)2 )

    x(1)2 from f(x2 | x

    (1)1 )

    x(2)1 from f(x1 | x

    (1)2 )

    ...

    x(k)1 from f(x1 | x

    (k1)2 )

    x(k)2 from f(x2 | x

    (k)1 )

    (To get started, prespecify an initial value for x(0)2 ).

    7

  • 8/13/2019 Gentle Introduction to MCMC

    9/85

    For example, suppose

    f(x1, x2) nx1

    xx1+12 (1 x2)nx1+1x1 = 0, 1, . . . , n , 0 x2 1.

    The GS proceeds by successively sampling from

    f(x1 | x2) = Binomial(n, x2)

    f(x2 | x1) = Beta(x1+, n x1+ )

    To illustrate the GS for the above, Figure 1 of Casella & George(1992) presents a histogram of a sample ofm = 500 final values

    of x1 from separate GS runs of length k = 10 when n = 16, = 2 and = 4. This is compared with an iid sample fromthe actual distribution f(x1), (which here can be shown to beBeta-Binomial).

    8

  • 8/13/2019 Gentle Introduction to MCMC

    10/85

  • 8/13/2019 Gentle Introduction to MCMC

    11/85

    Note that f(x1) =f(x1, x2)dx2 =

    f(x1 |x2)f(x2)dx2. This

    expression suggests that an improved estimate of f(x1) in this

    example can be obtained by inserting the m values ofx(k)

    2

    into

    f(x1) = 1

    m

    m

    i=1

    f(x1| x(i)2 ).

    Figure 3 of Casella & George (1992) illustrates the improvement

    obtained by this estimate.

    Note that the conditional distributions for the above setup, theBinomial and the Beta, can be simulated by routine methods.This is not always the case. For example, f(x1 |x2) from page2 is not of standard form. Fortunately, such distributions can be

    simulated using envelope methods such as rejection sampling, theratio-of-uniforms method or adaptive rejection resampling. Aswell see, Metropolis-Hastings algorithms can also be used for thispurpose.

    9

  • 8/13/2019 Gentle Introduction to MCMC

    12/85

  • 8/13/2019 Gentle Introduction to MCMC

    13/85

    3. Metropolis-Hastings Algorithms (MH)

    MH algorithms generate Markov chains which converge to f(x),

    by successively sampling from an (essentially) arbitrary proposaldistributionq(x|x) (i.e. a Markov transition kernel) and imposinga random rejection step at each transition.

    An MH algorithm for a candidate proposal distribution q(x | x),

    entails simulating x(1)

    , . . . , x

    (k)

    as follows:

    Simulate a transition candidate xC from q(x | x(j))

    Set x(j+1) =xC with probability

    (x(j), xC) = min1,q(x(j) | xC)

    q(xC | x(j))

    f(xC)

    f(x(j))

    Otherwise set x(j+1) =x(j).

    10

  • 8/13/2019 Gentle Introduction to MCMC

    14/85

    The original Metropolis algorithm was based on symmetric q,(i.e. q(x | x) =q(x | x)), for which is of the simple form

    (x(j), xC) = min

    1, f(xC)f(x(j))

    .

    Ifq(x | x) is chosen such that the Markov chain satisfies modestconditions (e.g. irreducibility and aperiodicity), then convergenceto f(x) is guaranteed. However, the rate of convergence will de-pend on the relationship between q(x | x) and f(x).

    When x is continuous, a popular choice for q(x | x) is x= x + zwhere z Np(0,). The resulting chain is called a random walk

    chain. Note that the choice of scale can critically affect themixing (i.e. movement) of the chain. Figure 1.1 on page 6 of Gilks,Richardson & Spieglehalter (1995) illustrates this when p = 1.Other distributions for z can also be used.

    11

  • 8/13/2019 Gentle Introduction to MCMC

    15/85

    Another useful choice, called an independence sampler, is obtainedwhen the proposal q(x | x) = q(x) does not depend on x. Theresulting is of the form

    (x(j), xC) = min

    1,q(x(j))

    q(xC)

    f(xC)

    f(x(j))

    .

    Such samplers work well whenq(x) is a good heavy-tailed approx-imation to f(x).

    It may be preferable to use an MH algorithm which updates the

    components x(j)i ofx one at a time. It can shown that the Gibbs

    sampler is just a special case of such a single-component MH al-gorithm where q is chosen so that 1.

    1

  • 8/13/2019 Gentle Introduction to MCMC

    16/85

    Finally, to see why MH algorithms work, it is not too hard to showthat the implied transition kernel p(x | x) of any MH algorithmsatisfies

    p(x | x)f(x) =p(x | x)f(x),

    a condition called detailed balance or reversibility. Integratingboth sides of this identity with respect to x yields

    p(x | x)f(x)dx =f(x),showing thatf(x) is the limiting distribution when the chain con-verges.

    4. The Model Liberation Movement

    Advances in computing technology have unleashed the power ofMonte Carlo methods, which in turn, are now unleashing the po-tential of statistical modeling.

  • 8/13/2019 Gentle Introduction to MCMC

    17/85

    Our new ability to simulate from complicated multivariate prob-ability distributions via MCMC is having impact in many areasof Statistics, but most profoundly for Bayesian approaches to sta-

    tistical modeling.

    The Bayesian paradigm uses probability to characterize ALLun-certainty as follows:

    Datais a realization from a model p(Data | ), where isan unknown (possibly multivariate) parameter.

    is treated as a realization from a prior distribution p().

    Post-data inference about is based on the posterior dis-tribution

    p( |Data) = p(Data | )p()p(Data | )p()d

  • 8/13/2019 Gentle Introduction to MCMC

    18/85

    In the past, analytical intractability of the expression forp(|Data)severely stymied realistic practical Bayesian methods. Unrealis-tic, oversimplified models were too often used to facilitate calcu-

    lations. MCMC has changed this, and opened up vast new realmsof modeling possibilities.

    My initial example

    f(x1, x2) (1 +x2

    1)1

    xn

    2 exp

    2

    x22i

    (yi x1)2

    x2

    was a just a disguised posterior distribution for the Bayesian setup

    y1, . . . , yn iid N(, 2)

    Cauchy(0, 1) Exponential(1).

    The posterior of the parameters and is

    p(, |Data) (1 +2)1n exp

    1

    22

    i

    (yi )2

    .

  • 8/13/2019 Gentle Introduction to MCMC

    19/85

    In the above example, f(x) can only be specified up to a normingconstant. This is typical of Bayesian formulations. A huge attrac-tion of GS and MH algorithms is that these norming constants arenot needed.

    The previous example is just a toy problem. MCMC is in fact

    enabling posterior calculation for extremely complicated modelswith hundreds and even thousands of parameters.

    Going even further, the Bayesian approach can be used to obtainposterior distributions over model spaces. Under such formula-tions, MCMC algorithms are leading to new search engines which

    automatically identify promising models.

  • 8/13/2019 Gentle Introduction to MCMC

    20/85

    References For Getting Started

    Casella, G. & George, E.I. (1992) Explaining the Gibbs Sampler, The

    American Statistician, 46, 167-174.

    Chib, S. & Greenberg, E. (1995) Understanding the Metropolis-HastingsAlgorithm, The American Statistician, 49, 327-335.

    Gilks, W. R., Richardson, S. & D.J. Spieglehalter (1995)Markov Chain

    Monte Carlo in Practice, Chapman & Hall, London.

    Robert, C.P. & Casella, G. (2004) Monte Carlo Statistical Methods,2nd Edition, Springer, New York.

  • 8/13/2019 Gentle Introduction to MCMC

    21/85

    Lecture II

    Bayesian Approaches for Model Uncertainty

    Ed GeorgeUniversity of Pennsylvania

    Seminaire de PrintempsVillars-sur-Ollon, Switzerland

    March 2005

  • 8/13/2019 Gentle Introduction to MCMC

    22/85

    1. A Probabilistic Setup for Model Uncertainty

    Suppose a set ofKmodels {M1, . . . ,M K} are under consideration

    for dataY.

    Under Mk, Y has density p(Y | k,Mk) where k is a vector ofunknown parameters that indexes the members ofMk. (Moreprecisely,Mk is a model class).

    The Bayesian approach proceeds by assigning a prior probabilitydistribution p(k |Mk) to the parameters of each model, and aprior probability p(Mk) to each model.

    Intuitively, this complete specification can be understood as athree stage hierarchical mixture model for generating the data Y;

    first the model Mk is generated from p(M1), . . . , p(MK), secondthe parameter vector k is generated from p(k |Mk), and thirdthe dataY is generated from p(Y | k,Mk).

  • 8/13/2019 Gentle Introduction to MCMC

    23/85

    LettingYfbe a future unknown observation, this formulation in-duces a joint distribution

    p(Yf,Y , k,Mk) =p(Yf, Y | k,Mk)p(k |Mk)p(Mk).

    Conditioning on Y , all remaining uncertainty is captured by thejoint posterior distributionp(Yf, k,Mk | Y). Through condition-ing and marginalization, this can be used for a variety Bayesianinferences and decisions.

    For example, for prediction one would margin out both k andMk and use the predictive distribution p(Yf | Y) which in effectaverages over all the unknown models.

  • 8/13/2019 Gentle Introduction to MCMC

    24/85

    Of particular interest are the posterior model probabilities

    p(Mk | Y) = p(Y |Mk)p(Mk)

    jp(Y |Mj)p(Mj)

    where

    p(Y |Mk) =

    p(Y | k,Mk)p(k |Mk)dk

    is the marginal or integrated likelihood ofMk.

    In terms of the three stage hierarchical mixture formulation,p(Mk|Y)is the probability that Mk generated the data, i.e. that Mk wasgenerated from p(M1), . . . , p(MK) in the first step.

    The model posterior distributionp(M1|Y), . . . , p(MK|Y) provides

    a complete post-data representation of model uncertainty and isthe fundamental object of interest for model selection and modelaveraging.

  • 8/13/2019 Gentle Introduction to MCMC

    25/85

    A natural and simple strategy for model selection is to choose themost probableMk, the one for which p(Mk |Y) largest. However,for the purpose of prediction with a single model, it may be bet-

    ter to use the median posterior model. Alternatively one mightprefer to report a set of high posterior models along with theirprobabilities to convey the model uncertainty.

    Based on these posterior probabilities, pairwise comparison ofmodels is summarized by the posterior odds

    p(M1 | Y)

    p(M2 | Y)=

    p(Y |M1)

    p(Y |M2)

    p(M1)

    p(M2).

    Note how the data, through the Bayes factor

    p(Y |M1)

    p(Y |M2) ,

    updates the prior odds to yield the posterior odds.

  • 8/13/2019 Gentle Introduction to MCMC

    26/85

    2. Examples

    As a first example, consider the problem of choosing between

    two nonnested models, M1

    and M2

    for discrete count data Y

    =(y1, . . . , yn) where

    p(Y | 1, M1) =n(1 )s, s=

    yi,

    a geometric distribution where 1 =, and

    p(Y | 2, M2) = ens

    xi!

    a Poisson distribution where 2=.

    Suppose further that uncertainty about 1 = is described by auniform prior

    p( | M1) = 1 for [0, 1]and uncertainty about2=is described by an exponential prior

    p( | M2) =e for [0,).

  • 8/13/2019 Gentle Introduction to MCMC

    27/85

    Under these priors, the marginal distributions are

    p(Y |M1) = 1

    0

    n(1 )sd= n!s!

    (n+s+ 1)!

    p(Y |M2) =

    0

    e(n+1)sxi!

    d= s!

    (n+ 1)s+1

    xi!

    The Bayes Factor for M1 vsM2 is then

    p(Y |M1)

    p(Y |M2) =

    n!(n+ 1)s+1

    xi!

    (n+s+ 1)! .

    When p(M1) =p(M2) = 1/2, this equals the posterior odds.

    Note that in contrast to the likelihood ratio statistic which com-pares maximized likelihoods, the Bayes factor compares averagedlikelihoods.

    Caution - the choice of priors here can be very influential.

  • 8/13/2019 Gentle Introduction to MCMC

    28/85

    As our second example, consider the problem of testingH0 : = 0 vs H1 : = 0 when y1, . . . , yn iid N(, 1).

    This can be treated as a Bayesian model selection problem byletting

    p(Y | 1,M1) =p(Y | 2,M2) = (2)n/2 exp

    (yi )

    2

    2

    and assigning different priors to 1 =2 =, namelyPr(= 0 |M1) = 1, i.e. a point mass at 0

    p( |M2) = (22)1/2 exp

    2

    22

    These priors yield marginal distributionsp(Y |M1) andp(Y |M2)that result in a Bayes factor of the form

    p(Y |M1)

    p(Y |M2)= (1 +n2)1/2 exp

    n22y2

    2(1 +n2)

  • 8/13/2019 Gentle Introduction to MCMC

    29/85

    3. General Considerations for Prior Selection

    For a given set of models M, the effectiveness of the Bayesian

    approach rests firmly on the specification of the parameter priorsp(k |Mk) and the model space prior p(M1), . . . , p(MK).

    The most common and practical approach to prior specificationin model uncertainty problems, especially large ones, is to tryand construct noninformative, semi-automatic formulations, using

    subjective and empirical Bayes considerations where needed.

    A simple and popular choice for the model space prior is

    p(Mk) 1/K

    which is noninformative in the sense of favoring all models equally.However, this can be deceptive because it may not be uniform overother characteristics such as model size.

  • 8/13/2019 Gentle Introduction to MCMC

    30/85

    Turning to the choice of parameter priors p(k |Mk), the use ofimproper noninformative priors must be ruled out because theirarbitrary norming constants are problematic for posterior odds

    comparisons.

    Proper priors guarantee the internal coherence of the Bayesianformulation and allow for meaningful hyperparameter specifica-tions.

    An important consideration for prior specification is the analyticalor numerical tractability for obtaining marginals p(Y |Mk).

    For nested model formulations, centering priors is often straight-forward. The crucial challenge is setting the prior dispersion. Itshould be large enough to avod too much prior influence, but small

    enough to avoid overly diffuse specifications. Note that in our pre-vious normal example, the Bayes factor goes to as , theBartlett-Lindley paradox.

  • 8/13/2019 Gentle Introduction to MCMC

    31/85

    4. Extracting Information from the Posterior

    When exact calculation of the posterior is not feasible, MCMC

    methods can often be used to simulate an approximate samplefrom the posterior. This can be used to estimate posterior char-acteristics or to search for high probability models.

    For a model characteristic, MCMC methods such as the such asthe GS and MH algorithms entail simulation of a Markov chain,

    say (1)

    , (2)

    , . . ., that is converging to its posterior distributionp( | Y).

    When p(Y |Mk) can be obtained analytically, the GS and MHalgorithms can be applied to directly simulate a model index from

    p(Mk | Y) p(Y |Mk)p(Mk).

    Otherwise, one must simulate from p(k,Mk | Y).

  • 8/13/2019 Gentle Introduction to MCMC

    32/85

    Conjugate priors are often used because of the computational ad-vantages of having closed form expressions forp(Y |Mk).

    Alternatively, it is sometimes useful to use a computable approx-imation for p(Y |Mk) such as a Laplace approximation

    p(Y |Mk) (2)dk/2|H(k)|

    1/2p(Y | k,Mk)p(k |Mk)

    where dk is the dimension ofk, k is the maximum ofh(k) logp(Y|k,Mk)p(k |Mk), andH(k) is minus the inverse Hessian

    ofh(k) evaluated at k.

    This is obtained by substituting the Taylor series approximationh(k) h(k)

    12

    (k k)H(k)(k k) forh(k) inp(Mk |Y) =

    exp{h(k)}dk.

    Going further people sometimes use the BIC approximation

    logp(Y |M) logp(Y | k,Mk) (dk/2) logn

    obtained by using the MLE k and ignoring the terms that areconstant in large samples.

  • 8/13/2019 Gentle Introduction to MCMC

    33/85

  • 8/13/2019 Gentle Introduction to MCMC

    34/85

    Lecture III

    Bayesian Variable Selection

    Ed GeorgeUniversity of Pennsylvania

    Seminaire de PrintempsVillars-sur-Ollon, Switzerland

    March 2005

  • 8/13/2019 Gentle Introduction to MCMC

    35/85

    1. The Variable Selection Problem

    Suppose one wants to model the relationship betweenYa variable

    of interest, and a subset ofx1, . . . , xp a set of potential explana-tory variables or predictors, but there is uncertainty about whichsubset to use. Such a situation is particularly of interest whenpis large and x1, . . . , xp is thought to contain many redundant orirrelevant variables.

    This problem has received the most attention under the normallinear model

    Y =1x1+ + pxp+ where Nn(0, 2I)

    when some unknown subset of regression coefficients are so smallthat it would be preferable to ignore them.

    This normal linear model setup is important not only because ofits analytical tractability, but also because it is a canonical ver-sion of other important problems such as modern nonparametricregression.

  • 8/13/2019 Gentle Introduction to MCMC

    36/85

    It will be convenient here to index each of the 2p possible subsetchoices by

    = (1, . . . , p),

    where i = 0 or 1 according to whether i is small or large, re-spectively. The size of the th subset is denoted q

    1. Werefer toas a model since it plays the same role as Mk describedin Lecture II.

    2. Model Space Priors for Variable Selection

    For the specification of the model space prior, most Bayesian vari-able selection implementations have used independence priors ofthe form

    p() = wii

    (1 wi)1i .

    Under this prior, each xi enters the model independently withprobabilityp(i = 1) = 1 p(i = 0) =wi.

  • 8/13/2019 Gentle Introduction to MCMC

    37/85

    A useful simplification of this yields

    p() =wq (1 w)pq ,

    wherew is the expected proportion ofxisin the model. A specialcase being the popular uniform prior

    p() 1/2p.

    Note that both of these priors are informative about the size ofthe model.

    Related priors that might also be considered are

    p() =B( + q, + p q)

    B(, )

    obtained putting a Beta prior on w, and more generally

    p() =

    pq

    1h(q)

    obtained by putting a priorh(q) on the model size.

  • 8/13/2019 Gentle Introduction to MCMC

    38/85

    3. Parameter Priors for Selection of Nonzero i

    When the goal is to ignore only those xi for which i = 0, the

    problem then becomes that of selecting a submodel of the form

    Y =X+ , Nn(0, 2I)

    where X is the n x qmatrix whose columns correspond to theth subset of x1, . . . , xp and is a q 1 vector of unknown

    regression coefficients. Here, (, 2

    ) plays the role ofkdescribedin Lecture II.

    Perhaps the most commonly applied parameter prior form for thissetup is the conjugate normal-inverse-gamma prior

    p(

    | 2, ) =Nq

    (0, 2

    ),

    p(2 | ) =p(2) =I G(/2,/2).

    (p(2) here is equivalent to /2 2).

  • 8/13/2019 Gentle Introduction to MCMC

    39/85

    A valuable feature of this prior is its analytical tractability; and 2 can be eliminated by routine integration to yield

    p(Y | ) |X

    X+ 1 |

    1/2

    ||1/2

    ( + S2)(n+)/2

    whereS2=Y

    Y YX(X

    X+ 1 )

    1XY.

    The use of these closed form expressions can substantially speedup posterior evaluation and MCMC exploration, as we will see.

    In choosing values for the hyperparameters that control p(2), may be thought of as a prior estimate of2, andmay be thoughtof as the prior sample size associated with this estimate.

    Let2FULLand2Ydenote the traditional estimates of

    2 based on

    the saturated and null models respectively. Treating2FULL and

    2Yas rough under- and over-estimates of2, one might choose

    andso thatp(2) assigns substantial probability to the interval(2FULL,

    2Y). This should at least avoid gross misspecification.

  • 8/13/2019 Gentle Introduction to MCMC

    40/85

    Alternatively, the explicit choice of and can be avoided byusingp(2) 1/2, the limit of the inverse-gamma prior as 0.

    For choosing the prior covariance matrix that controlsp(|2

    , ),specification is substantially simplified by setting =c V, wherec is a scalar and V is a preset form such as V = (X

    X)1 or

    V=Iq , the q q identity matrix.

    Having fixedV, the goal is then to choose clarge enough so that

    p( | 2

    , ) is relatively flat over the region of plausible valuesof , thereby reducing prior influence. At the same time it isimportant to avoid excessively large values ofc because the Bayesfactors will eventually put increasing weight on the null model asc , the Bartlett-Lindley paradox. For practical purposes, arough guide is to choose c so that p( |

    2, ) assigns substantial

    probability to the range of all plausible values for . Choices ofc between 10 and 10,000 seem to yield good results.

  • 8/13/2019 Gentle Introduction to MCMC

    41/85

    4. Posterior Calculation and Exploration

    The previous conjugate prior formulations allow for analytical

    margining out of and 2

    from p(Y , , 2

    | ) to yield a com-putable, closed form expression

    g() p(Y | )p() p(| Y)

    that can greatly facilitate posterior calculation and exploration.

    For example, when =c (XX)1, we can obtain

    g() = (1 + c)q/2( + YY (1 + 1/c)1WW)(n+)/2p()

    where W = T1XY for upper triangular T such that TT =

    XX (obtainable by the Cholesky decomposition). This repre-

    sentation allows for fast updating ofT, and hence W and g(),when is changed one component at a time, requiring O(q2) op-erations per update, where is the changed value.

  • 8/13/2019 Gentle Introduction to MCMC

    42/85

    The availability ofg() p(| Y) allows for the flexible construc-tion of MCMC algorithms that simulate a Markov chain

    (1)

    , (2)

    , (3)

    , . . .converging (in distribution) top(| Y).

    A variety of such MCMC algorithms can be conveniently obtainedby applying the GS with g(). For example, by generating eachcomponent from the full conditionals

    p(i | (i), Y)

    ((i) = {j : j = i}) where the i may be drawn in any fixed orrandom order.

    The generation of such components can be obtained rapidly as a

    sequence of Bernoulli draws using simple functions of the ratio

    p(i = 1, (i) | Y)

    p(i = 0, (i) | Y) =

    g(i= 1, (i))

    g(i= 0, (i)).

  • 8/13/2019 Gentle Introduction to MCMC

    43/85

    Such g() also facilitates the use of MH algorithms. Becauseg()/g() =p( | Y)/p( | Y), these are of the form:

    1. Simulate a candidate

    from a transition kernelq(

    | (j)

    ).2. Set (j+1) = with probability

    ( | (j)) = min

    q((j) | )

    q( | (j))

    g()

    g((j)), 1

    . (1)

    Otherwise, (j+1) =(j).

    A useful class of MH algorithms, the Metropolis algorithms, areobtained from the class of symmetric transition kernels of the form

    q(1 | 0) =qd

    if

    p

    1

    |0i 1

    i| =d. (2)

    which simulate a candidate by randomly changing d compo-nents of(j) with probability qd.

    10

  • 8/13/2019 Gentle Introduction to MCMC

    44/85

    When available, fast updating schemes for g() can be exploitedin all these MCMC algorithms.

    5. Extracting Information from the Output

    The simulated Markov chain sample (1), . . . , (K) contains valu-able information about the posterior p(| Y).

    Empirical frequencies provide consistent estimates of individualmodel probabilities or characteristics such as p(i= 0 | Y).

    When closed form g() is available, we can do better. For exam-ple, the exact relative probability of any two values 0 and 1 isobtained as g(0) / g(1) in the sequence of simulated values.

    11

  • 8/13/2019 Gentle Introduction to MCMC

    45/85

    Such g() also facilitates estimation of the normalizing constantp(|Y) =C g(). LetA be a preselected subset ofvalues and letg(A) = A g() so that p(A | Y) =C g(A). Then, a consistentestimate of C is

    C= 1

    g(A)K

    Kk=1

    IA((k))

    where IA( ) is the indicator of the set A.

    This yields improved estimates of the probability of individual valuesp(| Y) = C g(),

    as well as an estimate of the total visited probability

    p(B | Y) = C g(B),

    where B is the set of visited values.

    12

  • 8/13/2019 Gentle Introduction to MCMC

    46/85

    The simulated (1), . . . , (K) can also play an important role inmodel averaging. For example, suppose one wanted to predict aquantity of interest by the posterior mean

    E( | Y) =all

    E( | , Y)p(| Y).

    Whenpis too large for exhaustive enumeration and p(| Y) can-not be computed, E( | Y) is unavailable and is typically approx-

    imated by something of the form

    E( | Y) =S

    E( | , Y)p(| Y, S)

    where S is a manageable subset of models and p( | Y, S) is a

    probability distribution over S. (In some cases, E( | , Y) willalso need to be approximated).

    13

  • 8/13/2019 Gentle Introduction to MCMC

    47/85

    Letting Sbe the sampled values, a natural and consistent choicefor E( | Y) is

    Ef( | Y) =

    SE( | , Y)pf(| Y, S)

    where pf(| Y, S) is the relative frequency of inS. However, itappears that when g() is available, one can do better by using

    Eg( | Y) =

    S

    E( | , Y)pg(| Y, S)

    where pg(| Y, S) =g()/g(S) is the renormalized value ofg().

    For example, when S is an iid sample from p(| Y), Eg( | Y)approximates the best unbiased estimator ofE( | Y) as the sam-ple size increases. To see this, note that when S is an iid sample,

    Ef( | Y) is unbiased for E( | Y). Since S (together with g)is sufficient, the Rao-Blackwellized estimator E(Ef( | Y) | S) is

    best unbiased. But as the sample size increases, E(Ef( | Y) | S)

    Eg( | Y).

    14

  • 8/13/2019 Gentle Introduction to MCMC

    48/85

    6. Calibration and Empirical Bayes Variable Selection

    Let us now focus on the special case when the conjugate normal-

    inverse-gamma prior,

    p(| 2, ) =Nq (0, c

    2(XX)1),

    is combined with

    p() =wq (1 w)pq

    the simple independence prior; for the moment, lets assume 2 isknown.

    The hyperparameter c controls the expected size of the nonzerocoefficients of= (1, . . . , p)

    . The hyperparameterw controls

    the expected proportion of such nonzero components.

    15

  • 8/13/2019 Gentle Introduction to MCMC

    49/85

    Surprise! We will see that this prior setup is related to the canon-ical penalized sum-of-squares criterion

    CF() SS/

    2

    F q

    where SS =

    X

    X, (X

    X)1XY and F is a fixed

    penalty value for adding a variable.

    Popular model selection criteria simply entail maximizing CF()with particular choices ofF and 2 = 2.

    For orthogonal variables, xi added t2i > F.

    Some choices for F

    F= 0 : Select full model

    F= 2 : Cp and AIC F= log n : BIC

    F= 2 logp : RIC

    16

  • 8/13/2019 Gentle Introduction to MCMC

    50/85

    The relationship with CF() is obtained by reexpressing the modelposterior under the prior setup as

    p(| Y) exp c

    2(1 + c) {SS/2

    F(c, w) q}

    ,

    where

    F(c, w) =1 + c

    c

    2log

    1 w

    w + log(1 + c)

    .

    As a function offor fixedY,p(|Y) is increasing inCF() whenF = F(c, w). Thus, Bayesian model selection based on p(| Y)is equivalent to model selection based on the criterionCF(c,w)().For example, by appropriate choice ofc, w, the mode ofp( | Y)can be made to correspond to the best Cp, AIC, BIC or RICmodels.

    Since c and w control the expected size and proportion of thenonzero components of, the dependence ofF(c, w) on c and wprovides an implicit connection between the penalty F and theprofile of models for which its value may be appropriate.

    17

  • 8/13/2019 Gentle Introduction to MCMC

    51/85

    The awful truth: c and w are unknown

    Empirical Bayes Idea: Use cand w which maximize the marginal

    likelihood

    L(c, w | Y, )

    p(| w)p(Y | ,,c)

    wq (1 w)pq (1 + c)q/2 exp

    c SS

    22(1 + c)

    .

    For orthogonal xs (and known), this simplifies to

    L(c, w | Y, )

    pi=1

    [(1 w)et2

    i/2 + w(1 + c)1/2et2

    i/2(1+c)]

    where ti =bivi/ is the t-statistic associated with xi

    At least in the orthogonal case, cand wcan be found numericallyusing Gauss-Seidel, EM algorithm, etc.

    18

  • 8/13/2019 Gentle Introduction to MCMC

    52/85

    The best marginal maximum likelihood model is then the onewhich maximizes the posterior p(| Y, c, w, ) or equivalently

    CMML CF(c,w)

    In contrast to criteria of the form CF() with prespecified fixedF,CMMLuses an adaptive penaltyF(c, w) that is implicitly basedon the estimated distribution of the regression coefficients.

    Estimating after selecting MML might then proceed using

    E(| Y, c, w,, MML) = c

    1 + cMML

    A computable conditional maximum likelihood approximation CCML

    for the nonorthogonal case is available.

    19

  • 8/13/2019 Gentle Introduction to MCMC

    53/85

    Consider the simple model with X=I,

    Y =+ where Nn(0, I)

    where = (1, . . . , p)) is such that

    1, . . . , q iid N(0, c)

    q+1, . . . , p 0

    For p=n= 1000, and fixed values ofc and q, simulated Y fromthe above model

    Evaluate by estimating

    R(, ) Ec,qi

    (YiI[xi] i)2

    Figures 1ab and 2 illustrate the adaptive advantages of the em-pirical Bayes selection criteria.

    20

  • 8/13/2019 Gentle Introduction to MCMC

    54/85

    0

    1000

    2000

    3000

    Loss

    MML

    CML

    AIC/Cp

    BIC

    RIC

    CBIC

    MRIC

  • 8/13/2019 Gentle Introduction to MCMC

    55/85

    0

    100

    200

    300

    Loss

    MML

    CML

    BIC

    RIC/CBIC

    MRIC

  • 8/13/2019 Gentle Introduction to MCMC

    56/85

    0

    500

    1000

    1500

    2000

    2500

    3000

    Loss

    MML

    CML

    AIC/Cp

    BIC

    RIC

    CBIC

    MRIC

  • 8/13/2019 Gentle Introduction to MCMC

    57/85

    References For Getting Started

    Chipman, H., George, E.I. and McCulloch, R.E. (2001). The Practical

    Implementation of Bayesian Model Selection (with discussion). InModel Selection(P. Lahiri, ed.) IMS Lecture Notes MonographSeries, Volume 38, 65-134.

    George, E.I. and Foster, D.P. (2000) Calibration and empirical Bayesvariable selection. Biometrika87, 731-748.

    21

  • 8/13/2019 Gentle Introduction to MCMC

    58/85

  • 8/13/2019 Gentle Introduction to MCMC

    59/85

    1. Estimating a Normal Mean: A Brief History

    Observe X | Np(, I) and estimate by under

    RQ(,) =E(X) 2

    MLE(X) = X is the MLE, best invariant and minimax withconstant risk

    Shocking Fact: MLEis inadmissible when p 3. (Stein 1956) Bayes rules are a good place to look for improvements

    For a prior (), the Bayes rule (X) = E( | X) minimizesERQ(,)

    Remark: The (formal) Bayes rule under U() 1 isU(X) MLE(X) =X

    2

  • 8/13/2019 Gentle Introduction to MCMC

    60/85

    The Risk Functions of Two Minimax Estimators

  • 8/13/2019 Gentle Introduction to MCMC

    61/85

    H(X), the Bayes rule under the Harmonic priorH() = (p2),

    dominates U when p 3. (Stein 1974)

    a(X), the Bayes rule under a() where|

    s

    Np

    (0, s I) , s

    (1 +s)a2

    dominates Uand is proper Bayes when p = 5 and a[.5, 1) orwhen p 6 and a [0, 1). (Strawderman 1971)

    A Unifying Phenomenon: These domination results can be at-tributed to properties of the marginal distribution ofXunderHand a.

    3

  • 8/13/2019 Gentle Introduction to MCMC

    62/85

    The Bayes rule under () can be expressed as

    (X) =E( |X) =X+ log m(X)

    where

    m(X)

    e(X)2/2 () d

    is the marginal ofXunder(). ( = ( x1 , . . . , xp ))(Brown 1971)

    The risk improvement of (X) over U(X) can be expressed as

    RQ(,U) RQ(,) =E

    ( log m(X))2 22m(X)

    m(X)

    =E4

    2m(X)m(X)

    (2 =i 2x2

    i

    ) (Stein 1974, 1981)

    4

  • 8/13/2019 Gentle Introduction to MCMC

    63/85

    That H(X) dominates Uwhenp 3, follows from the fact thatthe marginal m(X) under His superharmonic, i.e.

    2

    m(X) 0

    That a(X) dominates U when p 5 (and conditions on a),follows from the fact that the sqrt of the marginal under a is

    superharmonic, i.e.2

    m(X) 0

    (Fourdrinier, Strawderman and Wells 1998)

    5

  • 8/13/2019 Gentle Introduction to MCMC

    64/85

    2. The Prediction Problem

    Observe X | Np(, vxI) and predict Y| Np(, vyI)

    Conditionally on , Y is independent ofX

    vx andvy are known (for now)

    The Problem: To estimate p(y | ) byq(y | x).

    Measure closeness by Kullback-Leibler loss,

    L(, q(y | x)) =

    p(y | )logp(y | )q(y | x) dy

    Risk functionRKL(,p) =

    L(, q(y | x))p(x | )dx = E[L(, q(y |X)]

    6

  • 8/13/2019 Gentle Introduction to MCMC

    65/85

    3. Bayes Rules for the Prediction Problem

    For a prior (), the Bayes rule

    p(y | x) =

    p(y | )( | x)d= E[p(y | )|X]

    minimizes

    RKL(,p)()d (Aitchison 1975)

    Let pU(y | x) denote the Bayes rule under U() 1

    pU(y | x) dominates p(y | = x), the naive plug-in predictivedistribution (Aitchison 1975)

    pU(y | x) is best invariant and minimax with constant risk(Murray 1977, Ng 1980, Barron and Liang 2003)

    Shocking Fact: pU(y | x) is inadmissible when p 3

    7

  • 8/13/2019 Gentle Introduction to MCMC

    66/85

    pH(y | x), the Bayes rule under the Harmonic prior

    H() = (p2),

    dominates pU(y | x) when p 3. (Komaki 2001).

    pa(y | x), the Bayes rule under a() where

    | s Np(0, s v0I) , s (1 +s)a2,

    dominates pU(y | x) and is proper Bayes when vx v0 and whenp= 5 and a [.5, 1) or when p 6 and a [0, 1). (Liang 2002)

    Main Question: Are these domination results attributable to theproperties ofm?

    8

  • 8/13/2019 Gentle Introduction to MCMC

    67/85

    4. A Key Representation for p(y | x) Let m(x; vx) denote the marginal ofX| Np(, vxI) under

    ().

    Lemma: The Bayes rule p(y | x) can be expressed as

    p(y | x) = m(w; vw)m(x; vx)

    pU(y | x)

    whereW =

    vyX+vxY

    vx+vy Np(, vwI)

    Using this, the risk improvement can be expressed as

    RKL

    (, pU

    )

    RKL

    (, p

    ) = pvx

    (x|)p

    vy(y|)log

    p(y | x)pU(y | x)

    dxdy

    =E,vwlog m(W; vw)E,vxlog m(X; vx)

    9

  • 8/13/2019 Gentle Introduction to MCMC

    68/85

    5. An Analogue of Steins Unbiased Estimate of Risk

    Theorem:

    vE,vlog m(Z; v) = E,v

    2m(Z; v)m(Z; v)

    12 log m(Z; v)2

    = E,v

    22

    m(Z; v)/

    m(Z; v)

    Proof relies on using the heat equation

    vm(z; v) =

    1

    22m(z; v)

    Remark: This shows that the risk improvement in the quadraticrisk estimation problem can be expressed in terms of log m as

    RQ(,U) RQ(,) = 2

    vE,vlog m(Z; v)

    v=1

    10

  • 8/13/2019 Gentle Introduction to MCMC

    69/85

    6. General Conditions for Minimax Prediction

    Let m(z; v) be the marginal distribution of Z| Np(,vI)under().

    Theorem: If m(z; v) is finite for all z, then p(y| x) will beminimax if either of the following hold:

    (i) m(z; v) is superharmonic(ii) m(z; v) is superharmonic

    Corollary: If m(z; v) is finite for all z, then p(y| x) will beminimax if() is superharmonic

    p(y | x) will dominate pU(y | x) in the above results if the super-harmonicity is strict on some interval.

    11

  • 8/13/2019 Gentle Introduction to MCMC

    70/85

    7. Sufficient Conditions for Admissibility

    Theorem (Blyths Method): If there is a sequence of finite non-negative measures satisfying n({: 1}) 1 such that

    En [RKL(, q)] En [RKL(, pn)] 0

    then q(y | x) is admissible.

    Theorem: For any two Bayes rules p andpn

    En [RKL(, p)]En [RKL(, pn)] =1

    2

    vxvw

    hn(z; v)2hn(z; v)

    m(z; v)dzdv

    where hn(z; v) =mn(z; v)/m(z; v).

    Using the explicit construction ofn() from Brown and Hwang(1984), we obtain tail behavior conditions that prove admissibilityofpU(y |x) whenp 2, and admissibility ofpH(y |x) whenp 3.

    12

  • 8/13/2019 Gentle Introduction to MCMC

    71/85

    8. Minimax Shrinkage Towards 0

    Because H andma are superharmonic under suitable condi-

    tions, the result that pH(y | x) and pa(y | x) dominate pU(y | x)and are minimax follows immediately from the Theorem.

    By the Theorem, any of the improper superharmonic t-priors ofFaith (1978) or any of the proper generalized t-priors of Four-

    drinier, Strawderman and Wells (1998) yield Bayes rules thatdominate pU(y | x) and are minimax.

    The risk functionsRKL(, pH) andRKL(, pa) take on their min-ima at = 0, and then asymptote up toRKL(, pU) as .

    13

  • 8/13/2019 Gentle Introduction to MCMC

    72/85

    Figure 1a displays the difference between the risk functions

    [RKL(, pU) RKL(, pH)]

    at = (c , . . . , c), 0 c 4 when vx = 1 and vy = 0.2 fordimensionsp= 3, 5, 7, 9.

    Figure 1b displays the difference between the risk functions

    [RKL(, pU) RKL(, pa)]

    at = (c , . . . , c), 0 c 4 when a = 0.5, vx = 1 and vy = 0.2for dimensions p= 3, 5, 7, 9.

    14

  • 8/13/2019 Gentle Introduction to MCMC

    73/85

    Figure 1a. The risk difference between Up and Hp : ),(),( HU pRpR .Here ),,( ccL= , 1=xv , 2.0=yv

  • 8/13/2019 Gentle Introduction to MCMC

    74/85

    Figure 1b. The risk difference between Up and ap with 5.0=a : (),( U RpR

    Here ),,( ccL= , 1=xv , 2.0=yv

  • 8/13/2019 Gentle Introduction to MCMC

    75/85

    Our Lemma representation

    pH(y|x) =mH(w; vw)

    mH(x; vx) pU(y|x)

    shows how pH(y| x) shrinks pU(y| x) towards 0 by an adaptivemultiplicative factor of the form

    bH(x, y) =mH(w; vw)

    mH(x; vx)

    Figure 2 illustrates how this shrinkage occurs for various values ofx when p= 5.

    15

  • 8/13/2019 Gentle Introduction to MCMC

    76/85

    Figure 2. Shrinkage of )|( xypU to obtain )|( xypH when 5=

    p . Here ,( 21 yyy =

    )0,0,0,0,2(=x )0,0,0,0,3(=x 0,0,0,4(=x

  • 8/13/2019 Gentle Introduction to MCMC

    77/85

    9. Shrinkage Towards Points or Subspaces

    We can trivially modify the previous priors and predictive distri-butions to shrink towards an arbitrary point b

    Rp.

    Consider the recentered prior

    b() =( b)

    and corresponding recentered marginal

    mb(z; v) =m(z b; v).

    This yields a predictive distribution

    pb

    (y|

    x) = mb(w; vw)

    mb(x; vx) pU

    (y|

    x)

    that now shrinks pU(y | x) towards b rather than 0.

    16

  • 8/13/2019 Gentle Introduction to MCMC

    78/85

    More generally, we can shrink pU(y | x) towards any subspace BofRp whenever , and hence m, is spherically symmetric.

    Letting PBz be the projection ofz onto B, shrinkage towards Bis obtained by using the recentered prior

    B() =( PB)which yields the reecentered marginal

    mB(z; v) :=m(z

    PBz; v).

    This modification yields a predictive distribution

    pB(y | x) = mB(w; vw)

    mB(x; vx) pU(y | x)

    that now shrinks pU(y | x) towards B. If mB(z; v) satisfies any of the conditions of the Theorem, then

    pB(y | x) will dominate pU(y | x) and be minimax.

    17

  • 8/13/2019 Gentle Introduction to MCMC

    79/85

    10. Minimax Multiple Shrinkage Prediction

    For any spherically symmetric prior, a set of subspacesB1, . . . , BN,and corresponding probabilities w

    1,...,w

    N , consider the recen-

    tered mixture prior

    () =Ni=1

    wiBi(),

    and corresponding recentered mixture marginal

    m(z; v) =N1

    wimBi (z; v).

    Applying the (X) =X+ log m(X) construction withm(X; v)yields minimax multiple shrinkage estimators of. (George 1986)

    18

  • 8/13/2019 Gentle Introduction to MCMC

    80/85

    Applying the predictive construction with m(z; v) yields

    p(y|

    x) =N

    i=1

    p(Bi|

    x)pBi (y|

    x)

    where pBi (y | x) is a single target predictive distribution and

    p(Bi | x) = wimBi (x; vx)

    Ni=1wimBi (x; vx)is the posterior weight on the ith prior component.

    Theorem: If eachmBi (z; v) is superharmonic, thenp(y | x) willdominate pU(y | x) and will be minimax.

    19

  • 8/13/2019 Gentle Introduction to MCMC

    81/85

    Figure 3 illustrates the risk reduction

    [RKL(, pU) RKL(, pH)]

    for = (c , . . . , c) obtained by pH which adaptively shrinkspU(y | x) towards the closer of the two points b1 = (2, . . . , 2) andb2 = (2, . . . ,2) using equal weights w1 =w2 = 0.5

    20

  • 8/13/2019 Gentle Introduction to MCMC

    82/85

    Figure 3. The risk difference between Up and multiple shrinkage*Hp : ),( UpR

    Here ),,( ccL= , 1=xv , 2.0=yv , ,21 =a 22 =a , 5.021 ==ww .

  • 8/13/2019 Gentle Introduction to MCMC

    83/85

    11. The Case of Unknown Variance

    Ifvx andvy are unknown, suppose there exists an available inde-pendent estimate ofvx of the form s/k where

    S vx2k.Also assume that vy =r vx, for a known constant r.

    Substitute the estimates vx = s/k, vy = rs/k and vw = rr+1s/kfor vx, vy and vw respectively.

    The predictor

    p(y | x) = m(w; vw)

    m(x; vx) pU(y | x)

    will still dominatep

    U

    (y|x) if any of the conditions of the Theorem

    are satisfied.

    Note however, pU(y | x) is no longer best invariant or minimax.

    21

  • 8/13/2019 Gentle Introduction to MCMC

    84/85

  • 8/13/2019 Gentle Introduction to MCMC

    85/85

    References For Getting Started

    Brown, L.D., George, E.I. and Xu, X. (2005). Admissible PredictiveEstimation. Working paper.

    George, E.I., Liang, F. and Xu, X. (2005). Improved Minimax Predic-tive Densities under Kullback-Leibler Loss. Annals of Statistics,to appear.