Gentle Introduction to MCMC

8/13/2019 Gentle Introduction to MCMC

1/85

Lecture I

A Gentle Introduction toMarkov Chain Monte Carlo (MCMC)

Ed GeorgeUniversity of Pennsylvania

Seminaire de PrintempsVillars-sur-Ollon, Switzerland

March 2005

1


2/85

1

1. MCMC: A New Approach to Simulation

Consider the general problem of trying to calculate characteristicsof a complicated multivariate probability distribution f(x) onx =(x1, . . . , xp).

For example, suppose we want to calculate the mean ofx1,

x1f(x1, x2)dx1dx2

where

f(x1, x2) (1 +x21)1xn2 exp

1

2x22 i(yi x1)

2 x2

(y1, . . . , yn are fixed constants). Bad news: This calculation isanalytically intractable.

2


3/85

A Monte Carlo approach: Simulate k observations x(1), . . . , x(k)

from f(x) and use this sample to estimate the characteristics of

interest. (Careful: Each x

(j)

= (x

(j)

1 , . . . , x

(j)

p ) is a multivariateobservation). For example, we could estimate the mean ofx1 by

x1 = 1

k

j

x(j)1 .

If x(1), . . . , x(k) were independent observations (i.e. an iid sam-ple), we could use standard central limit theorem results to drawinference about the quality of our estimate.

Bad news: In many problems, methods are unavailable for directsimulation of an iid sample from f(x).

3


4/85

Good news: In many problems, methods such as the Gibbs sam-pler and the Metropolis-Hastings algorithms can be used to sim-ulate a Markov chain x(1), . . . , x(k) which is converging in distri-

bution to f(x), (i.e. as k increases, the distribution of x(k) getscloser and closer to f(x)).

Recall that a Markov chain x(1), . . . , x(k) is a sequence such thatfor each j 1, x(j+1) is sampled from a distribution p(x | x(j))

which depends on x

(j)

(but not on x

(1)

, . . . , x

(j1)

).

The function p(x | x(j)) is called a Markov transition kernel. Ifp(x |x(j)) is time-homogeneous (i.e. p(x |x(j)) does not depend onj) and the transition kernel satisfies

p(x | x)f(x)dx =f(x),

then the chain will converge to f(x) if it converges at all.

4


5/85

Simulation of a Markov chain requires a starting value x(0). If thechain is converging to f(x), then the dependence betweenx(j) andx(0) diminishes as j increases. After a suitable burn in periodofliterations, x(l), . . . , x(k) behaves like a dependent sample fromf(x).

Such behavior is illustrated by Figure 1.1 on page 6 of Gilks,Richardson & Spieglehalter (1995).

The output from such simulated chains can be used to estimatethe characteristics off(x). For example, one can obtain approxi-mate iid samples of size m by taking the final x(k) values from mseparate chains.

It is probably more efficient, however, to use all the simulated

values. For example, x1 = 1k

jx

(j)1 will still converge to the

mean ofx1.

5


6/85


7/85

MCMC is the general procedure of simulating such Markov chainsand using them to draw inference about the characteristics off(x).

Methods which have ignited MCMC are the Gibbs sampler andthe more general Metropolis-Hastings algorithms. As will we nowsee, these are simply prescriptions for constructing a Markov tran-sition kernelp(x|x) which generates a Markov chainx(1), . . . , x(k)

converging to f(x).

2. The Gibbs Sampler (GS)

The GS is an algorithm for simulating a Markov chainx(1), . . . , x(k)

which is converging to f(x), by successively sampling from the fullconditional component distributionsf(xi|xi),i = 1, . . . , p, wherex

idenotes the components ofx other than x

i.

6


8/85

For simplicity, consider the case where p = 2. The GS generatesa Markov chain

(x(1)1 , x(1)2 ), (x(2)1 , x(2)2 ), . . . , (x(k)1 , x(

k)2 )

converging to f(x1, x2), by successively sampling

x(1)1 from f(x1 | x

(0)2 )

x(1)2 from f(x2 | x

(1)1 )

x(2)1 from f(x1 | x

(1)2 )

...

x(k)1 from f(x1 | x

(k1)2 )

x(k)2 from f(x2 | x

(k)1 )

(To get started, prespecify an initial value for x(0)2 ).

7


9/85

For example, suppose

f(x1, x2) nx1

xx1+12 (1 x2)nx1+1x1 = 0, 1, . . . , n , 0 x2 1.

The GS proceeds by successively sampling from

f(x1 | x2) = Binomial(n, x2)

f(x2 | x1) = Beta(x1+, n x1+ )

To illustrate the GS for the above, Figure 1 of Casella & George(1992) presents a histogram of a sample ofm = 500 final values

of x1 from separate GS runs of length k = 10 when n = 16, = 2 and = 4. This is compared with an iid sample fromthe actual distribution f(x1), (which here can be shown to beBeta-Binomial).

8


10/85


11/85

Note that f(x1) =f(x1, x2)dx2 =

f(x1 |x2)f(x2)dx2. This

expression suggests that an improved estimate of f(x1) in this

example can be obtained by inserting the m values ofx(k)

2

into

f(x1) = 1

m

m

i=1

f(x1| x(i)2 ).

Figure 3 of Casella & George (1992) illustrates the improvement

obtained by this estimate.

Note that the conditional distributions for the above setup, theBinomial and the Beta, can be simulated by routine methods.This is not always the case. For example, f(x1 |x2) from page2 is not of standard form. Fortunately, such distributions can be

simulated using envelope methods such as rejection sampling, theratio-of-uniforms method or adaptive rejection resampling. Aswell see, Metropolis-Hastings algorithms can also be used for thispurpose.

9


12/85


13/85

3. Metropolis-Hastings Algorithms (MH)

MH algorithms generate Markov chains which converge to f(x),

by successively sampling from an (essentially) arbitrary proposaldistributionq(x|x) (i.e. a Markov transition kernel) and imposinga random rejection step at each transition.

An MH algorithm for a candidate proposal distribution q(x | x),

entails simulating x(1)

, . . . , x

(k)

as follows:

Simulate a transition candidate xC from q(x | x(j))

Set x(j+1) =xC with probability

(x(j), xC) = min1,q(x(j) | xC)

q(xC | x(j))

f(xC)

f(x(j))

Otherwise set x(j+1) =x(j).

10


14/85

The original Metropolis algorithm was based on symmetric q,(i.e. q(x | x) =q(x | x)), for which is of the simple form

(x(j), xC) = min

1, f(xC)f(x(j))

.

Ifq(x | x) is chosen such that the Markov chain satisfies modestconditions (e.g. irreducibility and aperiodicity), then convergenceto f(x) is guaranteed. However, the rate of convergence will de-pend on the relationship between q(x | x) and f(x).

When x is continuous, a popular choice for q(x | x) is x= x + zwhere z Np(0,). The resulting chain is called a random walk

chain. Note that the choice of scale can critically affect themixing (i.e. movement) of the chain. Figure 1.1 on page 6 of Gilks,Richardson & Spieglehalter (1995) illustrates this when p = 1.Other distributions for z can also be used.

11


15/85

Another useful choice, called an independence sampler, is obtainedwhen the proposal q(x | x) = q(x) does not depend on x. Theresulting is of the form

(x(j), xC) = min

1,q(x(j))

q(xC)

f(xC)

f(x(j))

.

Such samplers work well whenq(x) is a good heavy-tailed approx-imation to f(x).

It may be preferable to use an MH algorithm which updates the

components x(j)i ofx one at a time. It can shown that the Gibbs

sampler is just a special case of such a single-component MH al-gorithm where q is chosen so that 1.

1


16/85

Finally, to see why MH algorithms work, it is not too hard to showthat the implied transition kernel p(x | x) of any MH algorithmsatisfies

p(x | x)f(x) =p(x | x)f(x),

a condition called detailed balance or reversibility. Integratingboth sides of this identity with respect to x yields

p(x | x)f(x)dx =f(x),showing thatf(x) is the limiting distribution when the chain con-verges.

4. The Model Liberation Movement

Advances in computing technology have unleashed the power ofMonte Carlo methods, which in turn, are now unleashing the po-tential of statistical modeling.


17/85

Our new ability to simulate from complicated multivariate prob-ability distributions via MCMC is having impact in many areasof Statistics, but most profoundly for Bayesian approaches to sta-

tistical modeling.

The Bayesian paradigm uses probability to characterize ALLun-certainty as follows:

Datais a realization from a model p(Data | ), where isan unknown (possibly multivariate) parameter.

is treated as a realization from a prior distribution p().

Post-data inference about is based on the posterior dis-tribution

p( |Data) = p(Data | )p()p(Data | )p()d


18/85

In the past, analytical intractability of the expression forp(|Data)severely stymied realistic practical Bayesian methods. Unrealis-tic, oversimplified models were too often used to facilitate calcu-

lations. MCMC has changed this, and opened up vast new realmsof modeling possibilities.

My initial example

f(x1, x2) (1 +x2

1)1

xn

2 exp

2

x22i

(yi x1)2

x2

was a just a disguised posterior distribution for the Bayesian setup

y1, . . . , yn iid N(, 2)

Cauchy(0, 1) Exponential(1).

The posterior of the parameters and is

p(, |Data) (1 +2)1n exp

1

22

i

(yi )2

.


19/85

In the above example, f(x) can only be specified up to a normingconstant. This is typical of Bayesian formulations. A huge attrac-tion of GS and MH algorithms is that these norming constants arenot needed.

The previous example is just a toy problem. MCMC is in fact

enabling posterior calculation for extremely complicated modelswith hundreds and even thousands of parameters.

Going even further, the Bayesian approach can be used to obtainposterior distributions over model spaces. Under such formula-tions, MCMC algorithms are leading to new search engines which

automatically identify promising models.


20/85

References For Getting Started

Casella, G. & George, E.I. (1992) Explaining the Gibbs Sampler, The

American Statistician, 46, 167-174.

Chib, S. & Greenberg, E. (1995) Understanding the Metropolis-HastingsAlgorithm, The American Statistician, 49, 327-335.

Gilks, W. R., Richardson, S. & D.J. Spieglehalter (1995)Markov Chain

Monte Carlo in Practice, Chapman & Hall, London.

Robert, C.P. & Casella, G. (2004) Monte Carlo Statistical Methods,2nd Edition, Springer, New York.


21/85

Lecture II

Bayesian Approaches for Model Uncertainty



March 2005


22/85

1. A Probabilistic Setup for Model Uncertainty

Suppose a set ofKmodels {M1, . . . ,M K} are under consideration

for dataY.

Under Mk, Y has density p(Y | k,Mk) where k is a vector ofunknown parameters that indexes the members ofMk. (Moreprecisely,Mk is a model class).

The Bayesian approach proceeds by assigning a prior probabilitydistribution p(k |Mk) to the parameters of each model, and aprior probability p(Mk) to each model.

Intuitively, this complete specification can be understood as athree stage hierarchical mixture model for generating the data Y;

first the model Mk is generated from p(M1), . . . , p(MK), secondthe parameter vector k is generated from p(k |Mk), and thirdthe dataY is generated from p(Y | k,Mk).


23/85

LettingYfbe a future unknown observation, this formulation in-duces a joint distribution

p(Yf,Y , k,Mk) =p(Yf, Y | k,Mk)p(k |Mk)p(Mk).

Conditioning on Y , all remaining uncertainty is captured by thejoint posterior distributionp(Yf, k,Mk | Y). Through condition-ing and marginalization, this can be used for a variety Bayesianinferences and decisions.

For example, for prediction one would margin out both k andMk and use the predictive distribution p(Yf | Y) which in effectaverages over all the unknown models.


24/85

Of particular interest are the posterior model probabilities

p(Mk | Y) = p(Y |Mk)p(Mk)

jp(Y |Mj)p(Mj)

where

p(Y |Mk) =

p(Y | k,Mk)p(k |Mk)dk

is the marginal or integrated likelihood ofMk.

In terms of the three stage hierarchical mixture formulation,p(Mk|Y)is the probability that Mk generated the data, i.e. that Mk wasgenerated from p(M1), . . . , p(MK) in the first step.

The model posterior distributionp(M1|Y), . . . , p(MK|Y) provides

a complete post-data representation of model uncertainty and isthe fundamental object of interest for model selection and modelaveraging.


25/85

A natural and simple strategy for model selection is to choose themost probableMk, the one for which p(Mk |Y) largest. However,for the purpose of prediction with a single model, it may be bet-

ter to use the median posterior model. Alternatively one mightprefer to report a set of high posterior models along with theirprobabilities to convey the model uncertainty.

Based on these posterior probabilities, pairwise comparison ofmodels is summarized by the posterior odds

p(M1 | Y)

p(M2 | Y)=

p(Y |M1)

p(Y |M2)

p(M1)

p(M2).

Note how the data, through the Bayes factor

p(Y |M1)

p(Y |M2) ,

updates the prior odds to yield the posterior odds.


26/85

2. Examples

As a first example, consider the problem of choosing between

two nonnested models, M1

and M2

for discrete count data Y

=(y1, . . . , yn) where

p(Y | 1, M1) =n(1 )s, s=

yi,

a geometric distribution where 1 =, and

p(Y | 2, M2) = ens

xi!

a Poisson distribution where 2=.

Suppose further that uncertainty about 1 = is described by auniform prior

p( | M1) = 1 for [0, 1]and uncertainty about2=is described by an exponential prior

p( | M2) =e for [0,).


27/85

Under these priors, the marginal distributions are

p(Y |M1) = 1

0

n(1 )sd= n!s!

(n+s+ 1)!

p(Y |M2) =

0

e(n+1)sxi!

d= s!

(n+ 1)s+1

xi!

The Bayes Factor for M1 vsM2 is then

p(Y |M1)

p(Y |M2) =

n!(n+ 1)s+1

xi!

(n+s+ 1)! .

When p(M1) =p(M2) = 1/2, this equals the posterior odds.

Note that in contrast to the likelihood ratio statistic which com-pares maximized likelihoods, the Bayes factor compares averagedlikelihoods.

Caution - the choice of priors here can be very influential.


28/85

As our second example, consider the problem of testingH0 : = 0 vs H1 : = 0 when y1, . . . , yn iid N(, 1).

This can be treated as a Bayesian model selection problem byletting

p(Y | 1,M1) =p(Y | 2,M2) = (2)n/2 exp

(yi )

2

2

and assigning different priors to 1 =2 =, namelyPr(= 0 |M1) = 1, i.e. a point mass at 0

p( |M2) = (22)1/2 exp

2

22

These priors yield marginal distributionsp(Y |M1) andp(Y |M2)that result in a Bayes factor of the form

p(Y |M1)

p(Y |M2)= (1 +n2)1/2 exp

n22y2

2(1 +n2)


29/85

3. General Considerations for Prior Selection

For a given set of models M, the effectiveness of the Bayesian

approach rests firmly on the specification of the parameter priorsp(k |Mk) and the model space prior p(M1), . . . , p(MK).

The most common and practical approach to prior specificationin model uncertainty problems, especially large ones, is to tryand construct noninformative, semi-automatic formulations, using

subjective and empirical Bayes considerations where needed.

A simple and popular choice for the model space prior is

p(Mk) 1/K

which is noninformative in the sense of favoring all models equally.However, this can be deceptive because it may not be uniform overother characteristics such as model size.


30/85

Turning to the choice of parameter priors p(k |Mk), the use ofimproper noninformative priors must be ruled out because theirarbitrary norming constants are problematic for posterior odds

comparisons.

Proper priors guarantee the internal coherence of the Bayesianformulation and allow for meaningful hyperparameter specifica-tions.

An important consideration for prior specification is the analyticalor numerical tractability for obtaining marginals p(Y |Mk).

For nested model formulations, centering priors is often straight-forward. The crucial challenge is setting the prior dispersion. Itshould be large enough to avod too much prior influence, but small

enough to avoid overly diffuse specifications. Note that in our pre-vious normal example, the Bayes factor goes to as , theBartlett-Lindley paradox.


31/85

4. Extracting Information from the Posterior

When exact calculation of the posterior is not feasible, MCMC

methods can often be used to simulate an approximate samplefrom the posterior. This can be used to estimate posterior char-acteristics or to search for high probability models.

For a model characteristic, MCMC methods such as the such asthe GS and MH algorithms entail simulation of a Markov chain,

say (1)

, (2)

, . . ., that is converging to its posterior distributionp( | Y).

When p(Y |Mk) can be obtained analytically, the GS and MHalgorithms can be applied to directly simulate a model index from

p(Mk | Y) p(Y |Mk)p(Mk).

Otherwise, one must simulate from p(k,Mk | Y).


32/85

Conjugate priors are often used because of the computational ad-vantages of having closed form expressions forp(Y |Mk).

Alternatively, it is sometimes useful to use a computable approx-imation for p(Y |Mk) such as a Laplace approximation

p(Y |Mk) (2)dk/2|H(k)|

1/2p(Y | k,Mk)p(k |Mk)

where dk is the dimension ofk, k is the maximum ofh(k) logp(Y|k,Mk)p(k |Mk), andH(k) is minus the inverse Hessian

ofh(k) evaluated at k.

This is obtained by substituting the Taylor series approximationh(k) h(k)

12

(k k)H(k)(k k) forh(k) inp(Mk |Y) =

exp{h(k)}dk.

Going further people sometimes use the BIC approximation

logp(Y |M) logp(Y | k,Mk) (dk/2) logn

obtained by using the MLE k and ignoring the terms that areconstant in large samples.


33/85


34/85

Lecture III

Bayesian Variable Selection



March 2005


35/85

1. The Variable Selection Problem

Suppose one wants to model the relationship betweenYa variable

of interest, and a subset ofx1, . . . , xp a set of potential explana-tory variables or predictors, but there is uncertainty about whichsubset to use. Such a situation is particularly of interest whenpis large and x1, . . . , xp is thought to contain many redundant orirrelevant variables.

This problem has received the most attention under the normallinear model

Y =1x1+ + pxp+ where Nn(0, 2I)

when some unknown subset of regression coefficients are so smallthat it would be preferable to ignore them.

This normal linear model setup is important not only because ofits analytical tractability, but also because it is a canonical ver-sion of other important problems such as modern nonparametricregression.


36/85

It will be convenient here to index each of the 2p possible subsetchoices by

= (1, . . . , p),

where i = 0 or 1 according to whether i is small or large, re-spectively. The size of the th subset is denoted q

1. Werefer toas a model since it plays the same role as Mk describedin Lecture II.

2. Model Space Priors for Variable Selection

For the specification of the model space prior, most Bayesian vari-able selection implementations have used independence priors ofthe form

p() = wii

(1 wi)1i .

Under this prior, each xi enters the model independently withprobabilityp(i = 1) = 1 p(i = 0) =wi.


37/85

A useful simplification of this yields

p() =wq (1 w)pq ,

wherew is the expected proportion ofxisin the model. A specialcase being the popular uniform prior

p() 1/2p.

Note that both of these priors are informative about the size ofthe model.

Related priors that might also be considered are

p() =B( + q, + p q)

B(, )

obtained putting a Beta prior on w, and more generally

p() =

pq

1h(q)

obtained by putting a priorh(q) on the model size.


38/85

3. Parameter Priors for Selection of Nonzero i

When the goal is to ignore only those xi for which i = 0, the

problem then becomes that of selecting a submodel of the form

Y =X+ , Nn(0, 2I)

where X is the n x qmatrix whose columns correspond to theth subset of x1, . . . , xp and is a q 1 vector of unknown

regression coefficients. Here, (, 2

) plays the role ofkdescribedin Lecture II.

Perhaps the most commonly applied parameter prior form for thissetup is the conjugate normal-inverse-gamma prior

p(

| 2, ) =Nq

(0, 2

),

p(2 | ) =p(2) =I G(/2,/2).

(p(2) here is equivalent to /2 2).


39/85

A valuable feature of this prior is its analytical tractability; and 2 can be eliminated by routine integration to yield

p(Y | ) |X

X+ 1 |

1/2

||1/2

( + S2)(n+)/2

whereS2=Y

Y YX(X

X+ 1 )

1XY.

The use of these closed form expressions can substantially speedup posterior evaluation and MCMC exploration, as we will see.

In choosing values for the hyperparameters that control p(2), may be thought of as a prior estimate of2, andmay be thoughtof as the prior sample size associated with this estimate.

Let2FULLand2Ydenote the traditional estimates of

2 based on

the saturated and null models respectively. Treating2FULL and

2Yas rough under- and over-estimates of2, one might choose

andso thatp(2) assigns substantial probability to the interval(2FULL,

2Y). This should at least avoid gross misspecification.


40/85

Alternatively, the explicit choice of and can be avoided byusingp(2) 1/2, the limit of the inverse-gamma prior as 0.

For choosing the prior covariance matrix that controlsp(|2

, ),specification is substantially simplified by setting =c V, wherec is a scalar and V is a preset form such as V = (X

X)1 or

V=Iq , the q q identity matrix.

Having fixedV, the goal is then to choose clarge enough so that

p( | 2

, ) is relatively flat over the region of plausible valuesof , thereby reducing prior influence. At the same time it isimportant to avoid excessively large values ofc because the Bayesfactors will eventually put increasing weight on the null model asc , the Bartlett-Lindley paradox. For practical purposes, arough guide is to choose c so that p( |

2, ) assigns substantial

probability to the range of all plausible values for . Choices ofc between 10 and 10,000 seem to yield good results.


41/85

4. Posterior Calculation and Exploration

The previous conjugate prior formulations allow for analytical

margining out of and 2

from p(Y , , 2

| ) to yield a com-putable, closed form expression

g() p(Y | )p() p(| Y)

that can greatly facilitate posterior calculation and exploration.

For example, when =c (XX)1, we can obtain

g() = (1 + c)q/2( + YY (1 + 1/c)1WW)(n+)/2p()

where W = T1XY for upper triangular T such that TT =

XX (obtainable by the Cholesky decomposition). This repre-

sentation allows for fast updating ofT, and hence W and g(),when is changed one component at a time, requiring O(q2) op-erations per update, where is the changed value.


42/85

The availability ofg() p(| Y) allows for the flexible construc-tion of MCMC algorithms that simulate a Markov chain

(1)

, (2)

, (3)

, . . .converging (in distribution) top(| Y).

A variety of such MCMC algorithms can be conveniently obtainedby applying the GS with g(). For example, by generating eachcomponent from the full conditionals

p(i | (i), Y)

((i) = {j : j = i}) where the i may be drawn in any fixed orrandom order.

The generation of such components can be obtained rapidly as a

sequence of Bernoulli draws using simple functions of the ratio

p(i = 1, (i) | Y)

p(i = 0, (i) | Y) =

g(i= 1, (i))

g(i= 0, (i)).


43/85

Such g() also facilitates the use of MH algorithms. Becauseg()/g() =p( | Y)/p( | Y), these are of the form:

1. Simulate a candidate

from a transition kernelq(

| (j)

).2. Set (j+1) = with probability

( | (j)) = min

q((j) | )

q( | (j))

g()

g((j)), 1

. (1)

Otherwise, (j+1) =(j).

A useful class of MH algorithms, the Metropolis algorithms, areobtained from the class of symmetric transition kernels of the form

q(1 | 0) =qd

if

p

1

|0i 1

i| =d. (2)

which simulate a candidate by randomly changing d compo-nents of(j) with probability qd.

10


44/85

When available, fast updating schemes for g() can be exploitedin all these MCMC algorithms.

5. Extracting Information from the Output

The simulated Markov chain sample (1), . . . , (K) contains valu-able information about the posterior p(| Y).

Empirical frequencies provide consistent estimates of individualmodel probabilities or characteristics such as p(i= 0 | Y).

When closed form g() is available, we can do better. For exam-ple, the exact relative probability of any two values 0 and 1 isobtained as g(0) / g(1) in the sequence of simulated values.

11


45/85

Such g() also facilitates estimation of the normalizing constantp(|Y) =C g(). LetA be a preselected subset ofvalues and letg(A) = A g() so that p(A | Y) =C g(A). Then, a consistentestimate of C is

C= 1

g(A)K

Kk=1

IA((k))

where IA( ) is the indicator of the set A.

This yields improved estimates of the probability of individual valuesp(| Y) = C g(),

as well as an estimate of the total visited probability

p(B | Y) = C g(B),

where B is the set of visited values.

12


46/85

The simulated (1), . . . , (K) can also play an important role inmodel averaging. For example, suppose one wanted to predict aquantity of interest by the posterior mean

E( | Y) =all

E( | , Y)p(| Y).

Whenpis too large for exhaustive enumeration and p(| Y) can-not be computed, E( | Y) is unavailable and is typically approx-

imated by something of the form

E( | Y) =S

E( | , Y)p(| Y, S)

where S is a manageable subset of models and p( | Y, S) is a

probability distribution over S. (In some cases, E( | , Y) willalso need to be approximated).

13


47/85

Letting Sbe the sampled values, a natural and consistent choicefor E( | Y) is

Ef( | Y) =

SE( | , Y)pf(| Y, S)

where pf(| Y, S) is the relative frequency of inS. However, itappears that when g() is available, one can do better by using

Eg( | Y) =

S

E( | , Y)pg(| Y, S)

where pg(| Y, S) =g()/g(S) is the renormalized value ofg().

For example, when S is an iid sample from p(| Y), Eg( | Y)approximates the best unbiased estimator ofE( | Y) as the sam-ple size increases. To see this, note that when S is an iid sample,

Ef( | Y) is unbiased for E( | Y). Since S (together with g)is sufficient, the Rao-Blackwellized estimator E(Ef( | Y) | S) is

best unbiased. But as the sample size increases, E(Ef( | Y) | S)

Eg( | Y).

14


48/85

6. Calibration and Empirical Bayes Variable Selection

Let us now focus on the special case when the conjugate normal-

inverse-gamma prior,

p(| 2, ) =Nq (0, c

2(XX)1),

is combined with

p() =wq (1 w)pq

the simple independence prior; for the moment, lets assume 2 isknown.

The hyperparameter c controls the expected size of the nonzerocoefficients of= (1, . . . , p)

. The hyperparameterw controls

the expected proportion of such nonzero components.

15


49/85

Surprise! We will see that this prior setup is related to the canon-ical penalized sum-of-squares criterion

CF() SS/

2

F q

where SS =

X

X, (X

X)1XY and F is a fixed

penalty value for adding a variable.

Popular model selection criteria simply entail maximizing CF()with particular choices ofF and 2 = 2.

For orthogonal variables, xi added t2i > F.

Some choices for F

F= 0 : Select full model

F= 2 : Cp and AIC F= log n : BIC

F= 2 logp : RIC

16


50/85

The relationship with CF() is obtained by reexpressing the modelposterior under the prior setup as

p(| Y) exp c

2(1 + c) {SS/2

F(c, w) q}

,

where

F(c, w) =1 + c

c

2log

1 w

w + log(1 + c)

.

As a function offor fixedY,p(|Y) is increasing inCF() whenF = F(c, w). Thus, Bayesian model selection based on p(| Y)is equivalent to model selection based on the criterionCF(c,w)().For example, by appropriate choice ofc, w, the mode ofp( | Y)can be made to correspond to the best Cp, AIC, BIC or RICmodels.

Since c and w control the expected size and proportion of thenonzero components of, the dependence ofF(c, w) on c and wprovides an implicit connection between the penalty F and theprofile of models for which its value may be appropriate.

17


51/85

The awful truth: c and w are unknown

Empirical Bayes Idea: Use cand w which maximize the marginal

likelihood

L(c, w | Y, )

p(| w)p(Y | ,,c)

wq (1 w)pq (1 + c)q/2 exp

c SS

22(1 + c)

.

For orthogonal xs (and known), this simplifies to

L(c, w | Y, )

pi=1

[(1 w)et2

i/2 + w(1 + c)1/2et2

i/2(1+c)]

where ti =bivi/ is the t-statistic associated with xi

At least in the orthogonal case, cand wcan be found numericallyusing Gauss-Seidel, EM algorithm, etc.

18


52/85

The best marginal maximum likelihood model is then the onewhich maximizes the posterior p(| Y, c, w, ) or equivalently

CMML CF(c,w)

In contrast to criteria of the form CF() with prespecified fixedF,CMMLuses an adaptive penaltyF(c, w) that is implicitly basedon the estimated distribution of the regression coefficients.

Estimating after selecting MML might then proceed using

E(| Y, c, w,, MML) = c

1 + cMML

A computable conditional maximum likelihood approximation CCML

for the nonorthogonal case is available.

19


53/85

Consider the simple model with X=I,

Y =+ where Nn(0, I)

where = (1, . . . , p)) is such that

1, . . . , q iid N(0, c)

q+1, . . . , p 0

For p=n= 1000, and fixed values ofc and q, simulated Y fromthe above model

Evaluate by estimating

R(, ) Ec,qi

(YiI[xi] i)2

Figures 1ab and 2 illustrate the adaptive advantages of the em-pirical Bayes selection criteria.

20


54/85

0

1000

2000

3000

Loss

MML

CML

AIC/Cp

BIC

RIC

CBIC

MRIC


55/85

0

100

200

300

Loss

MML

CML

BIC

RIC/CBIC

MRIC


56/85

0

500

1000

1500

2000

2500

3000

Loss

MML

CML

AIC/Cp

BIC

RIC

CBIC

MRIC


57/85


Chipman, H., George, E.I. and McCulloch, R.E. (2001). The Practical

Implementation of Bayesian Model Selection (with discussion). InModel Selection(P. Lahiri, ed.) IMS Lecture Notes MonographSeries, Volume 38, 65-134.

George, E.I. and Foster, D.P. (2000) Calibration and empirical Bayesvariable selection. Biometrika87, 731-748.

21


58/85


59/85

1. Estimating a Normal Mean: A Brief History

Observe X | Np(, I) and estimate by under

RQ(,) =E(X) 2

MLE(X) = X is the MLE, best invariant and minimax withconstant risk

Shocking Fact: MLEis inadmissible when p 3. (Stein 1956) Bayes rules are a good place to look for improvements

For a prior (), the Bayes rule (X) = E( | X) minimizesERQ(,)

Remark: The (formal) Bayes rule under U() 1 isU(X) MLE(X) =X

2


60/85

The Risk Functions of Two Minimax Estimators


61/85

H(X), the Bayes rule under the Harmonic priorH() = (p2),

dominates U when p 3. (Stein 1974)

a(X), the Bayes rule under a() where|

s

Np

(0, s I) , s

(1 +s)a2

dominates Uand is proper Bayes when p = 5 and a[.5, 1) orwhen p 6 and a [0, 1). (Strawderman 1971)

A Unifying Phenomenon: These domination results can be at-tributed to properties of the marginal distribution ofXunderHand a.

3


62/85

The Bayes rule under () can be expressed as

(X) =E( |X) =X+ log m(X)

where

m(X)

e(X)2/2 () d

is the marginal ofXunder(). ( = ( x1 , . . . , xp ))(Brown 1971)

The risk improvement of (X) over U(X) can be expressed as

RQ(,U) RQ(,) =E

( log m(X))2 22m(X)

m(X)

=E4

2m(X)m(X)

(2 =i 2x2

i

) (Stein 1974, 1981)

4


63/85

That H(X) dominates Uwhenp 3, follows from the fact thatthe marginal m(X) under His superharmonic, i.e.

2

m(X) 0

That a(X) dominates U when p 5 (and conditions on a),follows from the fact that the sqrt of the marginal under a is

superharmonic, i.e.2

m(X) 0

(Fourdrinier, Strawderman and Wells 1998)

5


66/85

pH(y | x), the Bayes rule under the Harmonic prior

H() = (p2),

dominates pU(y | x) when p 3. (Komaki 2001).

pa(y | x), the Bayes rule under a() where

| s Np(0, s v0I) , s (1 +s)a2,

dominates pU(y | x) and is proper Bayes when vx v0 and whenp= 5 and a [.5, 1) or when p 6 and a [0, 1). (Liang 2002)

Main Question: Are these domination results attributable to theproperties ofm?

8


68/85

5. An Analogue of Steins Unbiased Estimate of Risk

Theorem:

vE,vlog m(Z; v) = E,v

2m(Z; v)m(Z; v)

12 log m(Z; v)2

= E,v

22

m(Z; v)/

m(Z; v)

Proof relies on using the heat equation

vm(z; v) =

1

22m(z; v)

Remark: This shows that the risk improvement in the quadraticrisk estimation problem can be expressed in terms of log m as

RQ(,U) RQ(,) = 2

vE,vlog m(Z; v)

v=1

10


69/85

6. General Conditions for Minimax Prediction

Let m(z; v) be the marginal distribution of Z| Np(,vI)under().

Theorem: If m(z; v) is finite for all z, then p(y| x) will beminimax if either of the following hold:

(i) m(z; v) is superharmonic(ii) m(z; v) is superharmonic

Corollary: If m(z; v) is finite for all z, then p(y| x) will beminimax if() is superharmonic

p(y | x) will dominate pU(y | x) in the above results if the super-harmonicity is strict on some interval.

11


70/85

7. Sufficient Conditions for Admissibility

Theorem (Blyths Method): If there is a sequence of finite non-negative measures satisfying n({: 1}) 1 such that

En [RKL(, q)] En [RKL(, pn)] 0

then q(y | x) is admissible.

Theorem: For any two Bayes rules p andpn

En [RKL(, p)]En [RKL(, pn)] =1

2

vxvw

hn(z; v)2hn(z; v)

m(z; v)dzdv

where hn(z; v) =mn(z; v)/m(z; v).

Using the explicit construction ofn() from Brown and Hwang(1984), we obtain tail behavior conditions that prove admissibilityofpU(y |x) whenp 2, and admissibility ofpH(y |x) whenp 3.

12


71/85

8. Minimax Shrinkage Towards 0

Because H andma are superharmonic under suitable condi-

tions, the result that pH(y | x) and pa(y | x) dominate pU(y | x)and are minimax follows immediately from the Theorem.

By the Theorem, any of the improper superharmonic t-priors ofFaith (1978) or any of the proper generalized t-priors of Four-

drinier, Strawderman and Wells (1998) yield Bayes rules thatdominate pU(y | x) and are minimax.

The risk functionsRKL(, pH) andRKL(, pa) take on their min-ima at = 0, and then asymptote up toRKL(, pU) as .

13


72/85

Figure 1a displays the difference between the risk functions

[RKL(, pU) RKL(, pH)]

at = (c , . . . , c), 0 c 4 when vx = 1 and vy = 0.2 fordimensionsp= 3, 5, 7, 9.

Figure 1b displays the difference between the risk functions

[RKL(, pU) RKL(, pa)]

at = (c , . . . , c), 0 c 4 when a = 0.5, vx = 1 and vy = 0.2for dimensions p= 3, 5, 7, 9.

14


73/85

Figure 1a. The risk difference between Up and Hp : ),(),( HU pRpR .Here ),,( ccL= , 1=xv , 2.0=yv


74/85

Figure 1b. The risk difference between Up and ap with 5.0=a : (),( U RpR

Here ),,( ccL= , 1=xv , 2.0=yv


75/85

Our Lemma representation

pH(y|x) =mH(w; vw)

mH(x; vx) pU(y|x)

shows how pH(y| x) shrinks pU(y| x) towards 0 by an adaptivemultiplicative factor of the form

bH(x, y) =mH(w; vw)

mH(x; vx)

Figure 2 illustrates how this shrinkage occurs for various values ofx when p= 5.

15


76/85

Figure 2. Shrinkage of )|( xypU to obtain )|( xypH when 5=

p . Here ,( 21 yyy =

)0,0,0,0,2(=x )0,0,0,0,3(=x 0,0,0,4(=x


77/85

9. Shrinkage Towards Points or Subspaces

We can trivially modify the previous priors and predictive distri-butions to shrink towards an arbitrary point b

Rp.

Consider the recentered prior

b() =( b)

and corresponding recentered marginal

mb(z; v) =m(z b; v).

This yields a predictive distribution

pb

(y|

x) = mb(w; vw)

mb(x; vx) pU

(y|

x)

that now shrinks pU(y | x) towards b rather than 0.

16


78/85

More generally, we can shrink pU(y | x) towards any subspace BofRp whenever , and hence m, is spherically symmetric.

Letting PBz be the projection ofz onto B, shrinkage towards Bis obtained by using the recentered prior

B() =( PB)which yields the reecentered marginal

mB(z; v) :=m(z

PBz; v).

This modification yields a predictive distribution

pB(y | x) = mB(w; vw)

mB(x; vx) pU(y | x)

that now shrinks pU(y | x) towards B. If mB(z; v) satisfies any of the conditions of the Theorem, then

pB(y | x) will dominate pU(y | x) and be minimax.

17


79/85

10. Minimax Multiple Shrinkage Prediction

For any spherically symmetric prior, a set of subspacesB1, . . . , BN,and corresponding probabilities w

1,...,w

N , consider the recen-

tered mixture prior

() =Ni=1

wiBi(),

and corresponding recentered mixture marginal

m(z; v) =N1

wimBi (z; v).

Applying the (X) =X+ log m(X) construction withm(X; v)yields minimax multiple shrinkage estimators of. (George 1986)

18


80/85

Applying the predictive construction with m(z; v) yields

p(y|

x) =N

i=1

p(Bi|

x)pBi (y|

x)

where pBi (y | x) is a single target predictive distribution and

p(Bi | x) = wimBi (x; vx)

Ni=1wimBi (x; vx)is the posterior weight on the ith prior component.

Theorem: If eachmBi (z; v) is superharmonic, thenp(y | x) willdominate pU(y | x) and will be minimax.

19


81/85

Figure 3 illustrates the risk reduction

[RKL(, pU) RKL(, pH)]

for = (c , . . . , c) obtained by pH which adaptively shrinkspU(y | x) towards the closer of the two points b1 = (2, . . . , 2) andb2 = (2, . . . ,2) using equal weights w1 =w2 = 0.5

20


82/85

Figure 3. The risk difference between Up and multiple shrinkage*Hp : ),( UpR

Here ),,( ccL= , 1=xv , 2.0=yv , ,21 =a 22 =a , 5.021 ==ww .


83/85

11. The Case of Unknown Variance

Ifvx andvy are unknown, suppose there exists an available inde-pendent estimate ofvx of the form s/k where

S vx2k.Also assume that vy =r vx, for a known constant r.

Substitute the estimates vx = s/k, vy = rs/k and vw = rr+1s/kfor vx, vy and vw respectively.

The predictor

p(y | x) = m(w; vw)

m(x; vx) pU(y | x)

will still dominatep

U

(y|x) if any of the conditions of the Theorem

are satisfied.

Note however, pU(y | x) is no longer best invariant or minimax.

21


84/85


85/85


Brown, L.D., George, E.I. and Xu, X. (2005). Admissible PredictiveEstimation. Working paper.

George, E.I., Liang, F. and Xu, X. (2005). Improved Minimax Predic-tive Densities under Kullback-Leibler Loss. Annals of Statistics,to appear.

Gentle Introduction to MCMC

Documents

Transcript of Gentle Introduction to MCMC