Modl l i d idel selection and averaging – What, how, why ... · PDF fileModl l i d idel...

d l l i d iModel selection and averaging –What, how, why?

Solve Sæbø, IKBM, UMBSolve Sæbø, IKBM, UMB

UN

IVERSITE OutlineETET FOR M

ILJ

Outline– Statistical background

• ModelsJØ-

OG

BIOVIT

• Parameter estimation

• Hypothesis tests

TENSKAP

• Likelihood based tests

– Model selection…

I h t h ld d l b b t?• In what way should a model be best?

• Choosing av criterion (RSS, Cp, R2, AIC, BIC, CIC, DIC, EIC, FIC,…)

• What about prediction?

– Model averaging

www.umb.no

• Why? How?

UN

IVERSITE Some statisticsETET FOR M

ILJ

Some statistics

Typically we have gathered some data, say, a responsevariable and a set of predictor variables JØ

-O

G BIO

VIT

pfor observations.

Example:

TENSKAP

= log(Weight of bears)

x’s = Length, Sex, HeadLength, HeadWidth, NeckGirth, ChestGirthChestGirth

We would like to fit a model relating the response to the predictors in a ”best way”.

Purpose:

1. To learn about bear physiology, and/or

www.umb.no

2. To obtain a prediction model for Weight

UN

IVERSITE A modelETET FOR M

ILJ

A model

A statistical model describes the probability distribution of our data. A general model for our response may beJØ

-O

G BIO

VIT

g p y

describing how our response depend on the predictors TENSKAP

and a parameter set .

Example:

h li i d l ith l di t ib t dhence, a linear regression model with normal distributed error and

www.umb.no

UN

IVERSITE The likelihood functionETET FOR M

ILJ

The likelihood function

If we have n independent observations, the joint distribution of our data is given byJØ

-O

G BIO

VIT

g y

TENSKAP Which for our example equals

If we consider the data ( ) as given and the parameters as variables, this function is known as the likelihood function, L, for the parameters:

www.umb.no

, , p

UN

IVERSITE Parameter estimationETET FOR M

ILJ

Parameter estimationGiven a model and data we want to find an estimate for the unknown .

JØ-

OG

BIOVIT

There are MANY ways to estimate parameters…

Each method optimizes some criterion reflecting an i i f h t i ”b t”!

TENSKAP

opinion of what is ”best”!

Examples: Find those parameter values which:

Least Squares: Minimize:– Least Squares: Minimize:

– Maximum likelihood: Maximize

www.umb.no

– Bayes : Minimize overall ”risk”

– Etc…

UN

IVERSITE ”Goodness of fit” and testingETET FOR M

ILJ

Goodness of fit and testing

After estimating the parameters we typically want to assess the ”goodness of fit” of the model and answer Q’s JØ

-O

G BIO

VIT

g Qlike:

May the model be simplified? Should it be more complex? Which a iables a e impo tant?

TENSKAP

Which variables are important?

Before we proceed we must discuss hypothesis tests in general.g

www.umb.no

UN

IVERSITE Parameter spaceETET FOR M

ILJ

Parameter space

We assume in general that , that is, our unknown parameters are part of some parameter space . E.g. for a normal JØ

-O

G BIO

VIT

distributed variable we can assume that this space is two dimensional consisting of all values of for which

and .

TENSKAP

www.umb.no

UN

IVERSITE Restricted parameter space (simplified model)ETET FOR M

ILJ

Restricted parameter space (simplified model)

Under H0 we assume that such that H0 imposes a restriction on the parameter space. This means that , JØ

-O

G BIO

VIT

p p ,meaning that the parameter space under H0 is a subspace of . E.g. is defined by and .

TENSKAP

Under H1 we assume that ,that and that

www.umb.no

1 ,.

UN

IVERSITE Hypothesis tests H0 vs H1

ETET FOR M

ILJ

Hypothesis tests, H0 vs H1

In general tests rely on test statistics, functions of our data.JØ

-O

G BIO

VIT

The distribution of the test statistic must be known under H0TEN

SKAP

This is because we want to compute the probability of observing the given test statistic under H0.

Wald type test: If your data are very unlikely under HWald type test: If your data are very unlikely under H0(small p-value), then reject H0

Likelihood based tests: If the likelihood of the data is significantly larger under H1, then reject H0

www.umb.no

UN

IVERSITE Example: Test of expectation in normal distributionETET FOR M

ILJ

Example: Test of expectation in normal distribution

Assume for

vsJØ-

OG

BIOVIT

vs

Estimators:

1. Least squares: TENSKAP

q

2. Maximum likelihood:

Wald test:

Reject H0 if

Likelihood ratio test (approximate):

Reject H0 if where is the

www.umb.no

ratio of maximum likelihoods under H0 and H1, respectively

UN

IVERSITE

Model/variable selection

ETET FOR M

ILJ

Very often we want to select a ”best” model among a set of(nested) candidate models,

The result totally depends on what we mean by ”best”.JØ-

OG

BIOVIT

e esu t tota y depe ds o at e ea by best

1. A simple model with only significantly contributing variables?

– Forward/backward variable selection routines with significance testing in each stepTEN

SKAP

each step

2. Smallest Mean Sum of Squares of Error (MSE) (= largest adjusted R2)

– Best subset selection

3. Maximum likelihood?

– Pairwise comparisons using the likelihood ratio test.

4. Likelihood based criterion with penalization of complexity

– Best subset selection using AIC (BIC)

5. Minimum risk model

– Bayes factor: Bayesian model selection

www.umb.no

Bayes factor: Bayesian model selection

6. Minimum prediction error

– Compare models in terms of prediction error for new observations

UN

IVERSITE

Total error as a function of model complexity

ETET FOR M

ILJ Erro

r UnderfittingOverfittingJØ

-O

G BIO

VIT

E

Model error Estimation error

TENSKAP

Model error Estimation error

M d l l itModel complexity

www.umb.no

UN

IVERSITE

AIC

ETET FOR M

ILJ

The likelihood ratio test is useful for comparing two nested models. What if we have many models to compare?

JØ-

OG

BIOVIT

Akaike’s Information Criterion , AIC, is closely connected to the Likelihood ratio statistic:

TENSKAP where p is the number of model parameters.

For BIC the latter term is :

Hence, the AIC is a model selection criterion for likelihood-based estimation.

Benefit: Quick to compare many related (nested) modelsBenefit: Quick to compare many related (nested) models, penalizes complexity.

There are many related criteria to AIC: AICc, BIC, etc

www.umb.no

A typical approach is to select the model with smallest AIC as the best model (from a likelihood point of view)

UN

IVERSITE Some problems with AICETET FOR M

ILJ

Some problems with AIC

AIC should only be used to compare nested models (after my talk I have registrered that this might be wrong… JØ

-O

G BIO

VIT

y g g gMore investigations needed…!)

Bad small sample behaviour; better to use the modified AICc fo small samples

TENSKAP

AICc for small samples.

AIC is a random variable. Are models with similar AIC really different?y

On the latter problem there is an approximate solution for simple cases.

Assume there are two nested models with p1 and p2parameters respectively and let p1>p2. Then

www.umb.no

where

UN

IVERSITE Simulation exampleETET FOR M

ILJ

Simulation example

Data were simulated from a model

Two models were fitted to the dataJØ-

OG

BIOVIT

Two models were fitted to the data

– (p1=3)

– (p2=2)TENSKAP

(p2 2)

The value of was computed

The procedure was repeated 2500 timesp p

Next slide shows a histogram of the differences along with a -distribution superimposed.

Also the critical value for a significant difference is shown as the upper 5% percentile of the distribution (blue line)

www.umb.no

UN

IVERSITE Simulation exampleETET FOR M

ILJ

Simulation example

JØ-

OG

BIOVITTEN

SKAP

www.umb.no

UN

IVERSITE Non-nested models?ETET FOR M

ILJ

Non nested models?

How to compare non-nested models?A th l ti t thi ?

JØ-

OG

BIOVIT

– Are there any solutions to this?

– Can likelihoods be compared across non-nested models?

Vuong’s non-nested hypothesis test (R-library pcsl)TENSKAP

Vuong s non-nested hypothesis test… (R-library pcsl)

(Vuong, Q.H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica.

)57:307-333)

My (limited) experience: Tendency to select too complex modelsmodels

www.umb.no

UN

IVERSITE What about prediction?ETET FOR M

ILJ

What about prediction?

A model which predicts new data well must be a good model, right?JØ

-O

G BIO

VIT

, g

Why not select the model with smallest prediction error?

Popular among statisticians.TENSKAP ANY model which may provide predictions may be

compared!!! (No nesting required)

But…need new data…

Ci it i F t f l ith b dCiriterion: For a set of m new samples with observed values choose the model minimizing the mean squared error of prediction:

www.umb.no

UN

IVERSITE Hmm no new data ? Cross-validation!ETET FOR M

ILJ

Hmm, no new data…? Cross validation!

If new data are not obtainable, cross-validation may often be used.JØ

-O

G BIO

VIT

Simplest version (Leave-one-out cross-validation):

1. Leave one sample out of your data and fit the models TENSKAP

the n-1 remaining samples.

2. Predict the left out sample and record the squared errorfor this sample for all candidate models.for this sample for all candidate models.

3. Put the sample back in and remove another, repeat 2.

4. When all samples have been left out once, compute for each model the cross-validated MSEP:

www.umb.no

5. Compare the models

UN

IVERSITE MSEP(CV) draw-backsETET FOR M

ILJ

MSEP(CV) draw backs

The distribution of MSEP(CV) is unknown. Difficult to assess whether models are significantly different.JØ

-O

G BIO

VIT

g y

Cross-validation may be difficult in situastion where observations are partly dependent, e.g. for models with andom effects

TENSKAP

random effects.

– Then dependent blocks of data should be left out at each step

For large sample sizes cross-validation may take long time.

– The larger blocks of data may be left out in each iteration

Cross-validation routines are often not readily available in software. Requires some programming (e.g. in R)

www.umb.no

software. Requires some programming (e.g. in R)

UN

IVERSITE

A simulation study based on real data - bears

ETET FOR M

ILJ

Sex Head.L Head.W Neck.G Length Chest.G

0 10 5 15 45 23

1 11 6.5 20 47.5 24

1 12 6 17 57 27

1 12 5 5 20 5 59 5 38

JØ-

OG

BIOVIT

1 12.5 5 20.5 59.5 38

1 12 6 18 62 31

0 11 5.5 16 53 26

0 12 5.5 17 56 30.5

0 16.5 9 28 67.5 45TENSKAP

0 16.5 9 27 78 49

0 15.5 8 31 72 54

0 17 10 31.5 72 49

0 15.5 7.5 32 75 54.5

0 17.5 8 32 75 55

0 15 9 33 75 49

1 15.5 6.5 22 62 35

1 13 7 21 70 41

0 15 6.5 28 78 450 15 6.5 28 78 45

0 15 7.5 26.5 73.5 41

0 13.5 8 27 68.5 49

1 13.5 7 20 64 38

1 12.5 6 18 58 31

www.umb.no

0 16 9 29 73 44

0 9 4.5 13 37 19

1 12.5 4.5 10.5 63 32

0 14 5 21.5 67 37

UN

IVERSITE Simulation study 1ETET FOR M

ILJ

Simulation study 1

”log-weights” were simulated according to the model

JØ-

OG

BIOVIT

where

This is the true model which should be selected by theTENSKAP

This is the true model which should be selected by the various methods

Model selection was performed by the choosing the model among all predictor combinations with

– Minimum residual standard deviation, s

Maximum R2 and adjusted R2– Maximum R2 and adjusted R2

– Minimum AIC and BIC

– Minumum MSEPCV

www.umb.no

Minumum MSEPCV

The simulation was repeated 1000 times

UN

IVERSITE Results – Study 1ETET FOR M

ILJ

Results Study 1

Mean model complexity of chosen model (true model=1)

2 2 dj AIC BIC MSEPCVJØ-

OG

BIOVIT

s r2 r2adj AIC BIC MSEPCV 2.767 6.000 2.767 2.104 1.521 1.969Rate of selecting the true modelTEN

SKAP

Rate of selecting the true model

s r2 r2adj AIC BIC MSEPCV 0.146 0.000 0.146 0.341 0.544 0.3560.146 0.000 0.146 0.341 0.544 0.356

www.umb.no

UN

IVERSITE Study 2ETET FOR M

ILJ

Study 2


JØ-

OG

BIOVIT

where

TENSKAP Mean model complexity of chosen model (true model=2)

s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 2.947 6.000 2.947 2.386 1.864 2.225Rate of selecting the true modelg

s r2 r2adj AIC BIC MSEPCV 0.132 0.000 0.132 0.199 0.214 0.199

www.umb.no

UN


ILJ

Study 3


JØ-

OG

BIOVIT

where for the 25 bears in the data set.


s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 3.356 6.000 3.356 2.814 2.156 2.681

Rate of selecting the true model

s r2 r2adj AIC BIC MSEPCV

www.umb.no

0.148 0.000 0.148 0.165 0.133 0.177

UN


ILJ

Study 4


JØ-

OG

BIOVIT

where for the 25 bears in the data set.


s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 4.057 6.000 4.057 3.545 2.937 3.310 Rate of selecting the true modelg

s r2 r2adj AIC BIC MSEPCV 0.237 0.000 0.237 0.217 0.155 0.166

www.umb.no

UN

IVERSITE Model averagingETET FOR M

ILJ

Model averaging

Is there such a thing as a ”true model”?

Maybe better to find a set of ”good approximations”?JØ-

OG

BIOVIT

Maybe better to find a set of good approximations ?

Should we stick to one of these, or could the top candidates be combined?TEN

SKAP Model averaging deals with combining models for various objectives:

– For inference on parameters of nested models: Gives better quantification of uncertainty of model parameters. Uncertainty measures given the selected model does not

l d d l linclude your model selection uncertainty…

– For prediction: Combining models may improve prediction performance.

www.umb.no

p

UN

IVERSITE Bayesian Model Averaging (BMA)ETET FOR M

ILJ

Bayesian Model Averaging (BMA)

This is inference under the Bayesian estimation paradigm

Strategy:JØ-

OG

BIOVIT

Strategy:

– Choose prior probabilities for the set of candidate models, .TEN

SKAP

– Choose prior probabilities for the model parameters

– By means of Bayes theorem a posterior distribution of the model parameters may (usually) be derived whichmodel parameters may (usually) be derived which incorporates the model uncertainty.

– Estimation may be complicated and often involves Markov Chain Monte Carlo methods (MCMC) to approximate the posterior distributions.

www.umb.no

UN

IVERSITE Frequentis Model Averaging (FMA)ETET FOR M

ILJ

Frequentis Model Averaging (FMA)

Same objectives as BMA

Many proposed methods including AIC based weightedJØ-

OG

BIOVIT

Many proposed methods, including AIC-based weighted average (Buckland, Burnham and Augustin, 1997) and recent work by Hjort and Claeskens.

TENSKAP

Typically an FMA estimator for a parameter is derived as a weighted average of model spesific estimators:

The weights may be defined in many ways butThe weights, , may be defined in many ways, but should sum to 1.

www.umb.no

UN

IVERSITE AIC-based weightsETET FOR M

ILJ

AIC based weights

Bukland, Burnham and Augustin suggest AIC-basedweights defined asJØ

-O

G BIO

VIT

g

TENSKAP

This is motivated from a well known result on BIC

The approximate posterior probability of a model m beingt icorrect is

www.umb.no

UN

IVERSITE Example – Bears modellingETET FOR M

ILJ

Example Bears modelling

Reconsider the bears data

We use a data set from Study 4 and selectJØ-

OG

BIOVIT

We use a data set from Study 4 and select

– The best model (AIC) and report parameter estimates

– The best 5 models (AIC) and report averaged estimatesTENSKAP

The best 5 models (AIC) and report averaged estimates

www.umb.no

UN

IVERSITE Results – Model averagingETET FOR M

ILJ

Results Model averaging

Best AIC values

19 46738 19 90066 20 48747 20 50365 21 66067JØ-

OG

BIOVIT

19.46738 19.90066 20.48747 20.50365 21.66067Weights

0.299822 0.241422 0.180033 0.178583 0.100137TENSKAP Parameter estimates

intercept Sex Head.L Head.W Neck.G Length Chest.G

True 1.450 0.400 0.000 0.000 0.040 0.045 -0.200

Top AIC 0.055 0.525 0.115 0.190 0.000 0.053 -0.231

Top5 AIC 0.172 0.560 0.074 0.166 0.021 0.058 -0.236

www.umb.no

UN

IVERSITE Concluding remarksETET FOR M

ILJ

Concluding remarks

Averaging for inference on parameters only sensible for nested modelsJØ

-O

G BIO

VIT

For non-nested models it is still a good idea to do model averaging for prediction purposes!

TENSKAP

Then the predicted values should be averaged!

Also known in classification problems as ”majority voting”voting .

Main lesson: There is no best way of choosing modelsMain lesson: There is no best way of choosing models

The method should reflect what the analyst thinks is important!

www.umb.no

UN

IVERSITE Some literatureETET FOR M

ILJ

Some literature

Hjort and Claeskens, Frequentist Model AverageEstimators, JASA, 2003JØ

-O

G BIO

VIT

, ,

Burnham and Anderson, Model selection and MultimodelInference: A Practical Information-Theoretic Approach, S i N Y k 2002

TENSKAP

Springer, New York, 2002.

Buckland, Burnham and Augustin, Model selection: An integral part of inference, Biometrics, 1997 .g p , ,

Ripley, Pattern Recognition and Neural Networks, Cambridge Univ Press, 1996

www.umb.no

Modl l i d idel selection and averaging – What, how, why ... · PDF fileModl l i d idel...

Documents

Transcript of Modl l i d idel selection and averaging – What, how, why ... · PDF fileModl l i d idel...