Lecture 9 Model Comparison using MCMC and further models.

Lecture 9

Model Comparison using MCMC and further models

Lecture Contents

• Model comparison

• DIC diagnostic

• Random slopes regression model

• Priors for variance matrices

• MLwiN RSR demonstration

• Other predictor variables

• DIC in WinBUGS

Bayesian Model Comparison

• In Bayesian statistics model comparison is a thorny issue!!

• In MLwiN we used to suggest running IGLS for model selection then MCMC on your chosen model.

Why is it a thorny issue?• The posterior f(θ|Y) does not allow criticism of the model

in light of the observed data nor comparison amonst models.

• It is f(Y) that can be used to assess model performance.• Regardless of the model, f(Y) is a density over the space

of observables which can be compared with what was actually observed.

Bayes Factors

• If we observe YOBS and have 2 models M1 and M2 then the Bayes Factor is

This provides the relative weight of evidence for model M1 compared to model M2.

Rough calibration of the Bates factor has been proposed:

.)|(

)|(

2

1

MYf

MYfBF

OBS

OBS

BF <1 1-3 3-12 12-150 >150

Evidence -ve weak +ve Strong V. Strong

Problems with Bayes Factor

1. When prior is vague -> f(θ) is improperThis implies that even though f(θ |Y) may be

proper, f(Y) is improper so Bayes Factors cannot be used!

2. Computation of the Bayes factor itself requires high-dimensional integration.

3. Lindley’s paradox – data points to rejection but prior is diffuse so denominator of Bayes factor much smaller than numerator and too much weight given to parsimonious models.

Other related ideas

• Prior predictive distributions f(Y).• Cross-validation predictive distributions

F(yr|Y(r)). • Posterior predictive distributions f(Y|Yobs).• Model uncertainty – where the model is

itself a parameter to be estimated.• Bayesian model averaging.• Reversible jump MCMC.

Model Comparison for random effect models

• As we will typically use diffuse priors, Bayes factors are not an option here.

• The methods listed previously are possibilities but not built into software packages.

• The Deviance Information Criterion (DIC) is one possibility but is it a saviour for Bayesian model choice or a white elephant?

DIC – Spiegelhalter et al. (2002)

Plus points:

1. Discussion paper proposing it written by leading figures in Bayesian modelling.

2. Available in both MLwiN and WinBUGS for standard models

Minus points:

The paper was given a very mixed reception at the RSS when it was discussed!

DIC

A natural way to compare models is to use a criterion based on a trade-off between the fit of the data to the model and the corresponding complexity of the model.

DIC does this in a Bayesian way.DIC = ‘goodness of fit’ + ‘complexity’.Fit is measured by deviance

Complexity is measured by an estimate of the ‘effective number of parameters’ defined as

i.e. Posterior mean deviance minus the deviance evaluated at the posterior mean of the parameters.

)|(log2)( dataLD

)(

])[(][ ||

DD

EDDEp yyD

DIC (continued)

• The DIC is then defined analagously to AIC as

Models with smaller DIC are better supported by the data.

• DIC can be monitored in WinBUGS from Inference/DIC menu.

• DIC is available in MLwiN under the Model/MCMC menu.

D

D

pD

pDDIC

2)(

Education dataset

• We can fit a simple (Bayesian) linear regression in MLwiN

• The DIC output is as follows:

Param

Dbar 9763.54

D(thetabar) 9760.51

PD 3.02

DIC 9766.56

← Note PD ~ 3 = the actual number of parameters

Variance components model

Here we consider the random intercepts model from earlier practicals

This is the parallel lines model

Change in DICModel Dbar Dthetabar PD DIC

Regression 9763.54 9760.51 3.02 9766.56

VC 9209.15 9149.16 59.98 9269.13

Here we see the clear improvement in fitting random effects for school.

Note that the effective number of parameters is ~60 compared with 68 actual parameters in the dataset due to random rather than fixed school effects.

Random slopes model (crossing lines)

),0(~

:),0(~

2

2101

20

1

0

111

000

10

eij

uu

uuu

j

j

jj

jj

ijijjjij

Ne

Nu

u

u

u

exy

-3 0 1 3

0 + 1x1ij

school 2

school 1

u1,1

u0,2 u1,2

u0,1

Fitting an RSR in a Bayesian Framework

The basic random slopes regression model is as follows:

To this model we need to add priors for

),0(~

:),0(~

2

2101

20

1

0

111

000

10

eij

uu

uuu

j

j

jj

jj

ijijjjij

Ne

Nu

u

u

u

exy

. and ,, 2ue

Wishart priors

For a (kxk) variance matrix parameter in a Normal likelihood the conjugate prior is the inverse Wishart distribution with parameters ν and S

This distribution looks complex but is simply a multivariate generalisation of the inverse Gamma distribution.

))(exp(

)(2)(

1212/)1(2/

1

21

1

4/)1(2/

StrS

p

k

ik

i

kkvk

Wishart prior for Ωu-1

In MLwiN we use an inverse Wishart prior for

the precision matrix:

Note this is a (weakly informative) prior as the first parameter represents the prior sample size and is set to the smallest feasible value. Browne and Draper have looked at alternative Wishart priors as well as a Uniform prior and performed simulations.

u

1

of estimateprior a is ˆ where

)ˆ,(~)(

u

upu ppWishartp

Gibbs Sampling algorithm for RSR model

Repeat the following four steps• 1. Generate β from its (Multivariate)

Normal conditional distribution.• 2. Generate each uj from its (Multivariate)

Normal conditional distribution.• 3. Generate Ωu

-1 from its Wishart conditional distribution.

• 3. Generate 1/σe2 from its Gamma

conditional distribution

Bayesian RSR Model for education dataset

Note IGLS estimates used in prior. Variance (posterior mean) estimates bigger than IGLS estimates.

DIC for RSR model

Model Dbar Dthetabar PD DIC

RSR 9122.99 9031.32 91.67 9214.65

VC 9209.15 9149.16 59.98 9269.13

As with the frequentist approach the random slopes model is an improvement over the random intercepts model. The additional 65 random parameters only add 32 effective parameters

Trajectories for the RSR model

MCMC Diagnostics for Ωu00

Predictions for the RSR model with highlighted data

• Here the top and bottom school are highlighted:

Residuals for the RSR

Individually:

and pairwise:

Uniform Priors

Here the level 2 variance estimates increase as in Browne and Draper (2000)

Browne and Draper found that the Wishart priors were preferable although the use of the IGLS estimate is not strictly Bayesian as we are using the data twice!

Other predictors in the education dataset

This dataset has other predictors such as gender and school gender that can be considered in the practical.

In the next slide we see the equations window for a model with these added which has DIC 9189.26 a reduction of over 25 on the earlier RSR model

RSR + gender effects

WinBUGS RSR & gendermodel{

# Level 1 definitionfor(i in 1:N) {

normexam[i] ~ dnorm(mu[i],tau)mu[i]<- beta[1] * cons[i] + beta[2] * standlrt[i]+ beta[3] * girl[i]+ beta[4] * boysch[i]+ beta[5] * girlsch[i]+ u2[school[i],1] * cons[i] + u2[school[i],2] * standlrt[i]

}# Higher level definitions

for (j in 1:n2) {u2[j,1:2] ~ dmnorm(zero2[1:2],tau.u2[1:2,1:2])

}# Priors for fixed effectsfor (k in 1:5) { beta[k] ~ dflat() }# Priors for random termstau ~ dgamma(0.001000,0.001000)sigma2 <- 1/taufor (i in 1:2) {zero2[i] <- 0}tau.u2[1:2,1:2] ~ dwish(R2[1:2, 1:2],2)sigma2.u2[1:2,1:2] <- inverse(tau.u2[,])}

← Here we see the WiNBUGS code for our last model.

Notice how MVN and Wishart distributions are specified in WinBUGS

DIC in WinBUGS

In WinBUGS DIC is available from the Inference menu:

The DIC is set after the burnin and then the DIC button is pressed after running giving:

Dbar = post.mean of -2logL; Dhat = -2LogL at post.mean of stochastic nodes

Dbar Dhat pD DIC

Normexam 9098.590 9007.960 90.631 9189.220

total 9098.590 9007.960 90.631 9189.220

Parameter estimates in WinBUGS

• Note that here we see that WinBUGS gives similar estmates as MLwiN for the model. Note that for the fixed effects β that WinBUGS indexes from 1 while MLwiN indexes from 0.

node mean sd (2.5%, 97.5%)beta[1] -0.19 0.05236 (-0.2996, -0.08937)beta[2] 0.5539 0.01971 (0.5145, 0.5911)beta[3] 0.1687 0.03399 (0.1019, 0.2349)beta[4] 0.1775 0.1008 (-0.02581, 0.3781)beta[5] 0.175 0.08212 (0.004677, 0.3318)sigma2 0.5511 0.0125 (0.5272,0.576)sigma2.u2[1,1] 0.08777 0.01885 (0.05745, 0.1305)sigma2.u2[1,2] 0.02141 0.00719 (0.009262,0.0372)sigma2.u2[2,2] 0.01545 0.00460 (0.008271,0.02603)

Next Practical

The next practical is free ranging:

• You can follow the MLwiN chapter on RSR models that is given.

• You can try out RSR models in WinBUGS.

• You can try out fitting random effect models to the orthodont dataset using MCMC.

• You can try out DIC on other models.

Lecture 9 Model Comparison using MCMC and further models.

Documents

Transcript of Lecture 9 Model Comparison using MCMC and further models.