Modl l i d idel selection and averaging – What, how, why ... · PDF fileModl l i d idel...
Transcript of Modl l i d idel selection and averaging – What, how, why ... · PDF fileModl l i d idel...
d l l i d iModel selection and averaging –What, how, why?
Solve Sæbø, IKBM, UMBSolve Sæbø, IKBM, UMB
UN
IVERSITE OutlineETET FOR M
ILJ
Outline– Statistical background
• ModelsJØ-
OG
BIOVIT
• Parameter estimation
• Hypothesis tests
TENSKAP
• Likelihood based tests
– Model selection…
I h t h ld d l b b t?• In what way should a model be best?
• Choosing av criterion (RSS, Cp, R2, AIC, BIC, CIC, DIC, EIC, FIC,…)
• What about prediction?
– Model averaging
www.umb.no
• Why? How?
UN
IVERSITE Some statisticsETET FOR M
ILJ
Some statistics
Typically we have gathered some data, say, a responsevariable and a set of predictor variables JØ
-O
G BIO
VIT
pfor observations.
Example:
TENSKAP
= log(Weight of bears)
x’s = Length, Sex, HeadLength, HeadWidth, NeckGirth, ChestGirthChestGirth
We would like to fit a model relating the response to the predictors in a ”best way”.
Purpose:
1. To learn about bear physiology, and/or
www.umb.no
2. To obtain a prediction model for Weight
UN
IVERSITE A modelETET FOR M
ILJ
A model
A statistical model describes the probability distribution of our data. A general model for our response may beJØ
-O
G BIO
VIT
g p y
describing how our response depend on the predictors TENSKAP
and a parameter set .
Example:
h li i d l ith l di t ib t dhence, a linear regression model with normal distributed error and
www.umb.no
UN
IVERSITE The likelihood functionETET FOR M
ILJ
The likelihood function
If we have n independent observations, the joint distribution of our data is given byJØ
-O
G BIO
VIT
g y
TENSKAP Which for our example equals
If we consider the data ( ) as given and the parameters as variables, this function is known as the likelihood function, L, for the parameters:
www.umb.no
, , p
UN
IVERSITE Parameter estimationETET FOR M
ILJ
Parameter estimationGiven a model and data we want to find an estimate for the unknown .
JØ-
OG
BIOVIT
There are MANY ways to estimate parameters…
Each method optimizes some criterion reflecting an i i f h t i ”b t”!
TENSKAP
opinion of what is ”best”!
Examples: Find those parameter values which:
Least Squares: Minimize:– Least Squares: Minimize:
– Maximum likelihood: Maximize
www.umb.no
– Bayes : Minimize overall ”risk”
– Etc…
UN
IVERSITE ”Goodness of fit” and testingETET FOR M
ILJ
Goodness of fit and testing
After estimating the parameters we typically want to assess the ”goodness of fit” of the model and answer Q’s JØ
-O
G BIO
VIT
g Qlike:
May the model be simplified? Should it be more complex? Which a iables a e impo tant?
TENSKAP
Which variables are important?
Before we proceed we must discuss hypothesis tests in general.g
www.umb.no
UN
IVERSITE Parameter spaceETET FOR M
ILJ
Parameter space
We assume in general that , that is, our unknown parameters are part of some parameter space . E.g. for a normal JØ
-O
G BIO
VIT
distributed variable we can assume that this space is two dimensional consisting of all values of for which
and .
TENSKAP
www.umb.no
UN
IVERSITE Restricted parameter space (simplified model)ETET FOR M
ILJ
Restricted parameter space (simplified model)
Under H0 we assume that such that H0 imposes a restriction on the parameter space. This means that , JØ
-O
G BIO
VIT
p p ,meaning that the parameter space under H0 is a subspace of . E.g. is defined by and .
TENSKAP
Under H1 we assume that ,that and that
www.umb.no
1 ,.
UN
IVERSITE Hypothesis tests H0 vs H1
ETET FOR M
ILJ
Hypothesis tests, H0 vs H1
In general tests rely on test statistics, functions of our data.JØ
-O
G BIO
VIT
The distribution of the test statistic must be known under H0TEN
SKAP
This is because we want to compute the probability of observing the given test statistic under H0.
Wald type test: If your data are very unlikely under HWald type test: If your data are very unlikely under H0(small p-value), then reject H0
Likelihood based tests: If the likelihood of the data is significantly larger under H1, then reject H0
www.umb.no
UN
IVERSITE Example: Test of expectation in normal distributionETET FOR M
ILJ
Example: Test of expectation in normal distribution
Assume for
vsJØ-
OG
BIOVIT
vs
Estimators:
1. Least squares: TENSKAP
q
2. Maximum likelihood:
Wald test:
Reject H0 if
Likelihood ratio test (approximate):
Reject H0 if where is the
www.umb.no
ratio of maximum likelihoods under H0 and H1, respectively
UN
IVERSITE
Model/variable selection
ETET FOR M
ILJ
Very often we want to select a ”best” model among a set of(nested) candidate models,
The result totally depends on what we mean by ”best”.JØ-
OG
BIOVIT
e esu t tota y depe ds o at e ea by best
1. A simple model with only significantly contributing variables?
– Forward/backward variable selection routines with significance testing in each stepTEN
SKAP
each step
2. Smallest Mean Sum of Squares of Error (MSE) (= largest adjusted R2)
– Best subset selection
3. Maximum likelihood?
– Pairwise comparisons using the likelihood ratio test.
4. Likelihood based criterion with penalization of complexity
– Best subset selection using AIC (BIC)
5. Minimum risk model
– Bayes factor: Bayesian model selection
www.umb.no
Bayes factor: Bayesian model selection
6. Minimum prediction error
– Compare models in terms of prediction error for new observations
UN
IVERSITE
Total error as a function of model complexity
ETET FOR M
ILJ Erro
r UnderfittingOverfittingJØ
-O
G BIO
VIT
E
Model error Estimation error
TENSKAP
Model error Estimation error
M d l l itModel complexity
www.umb.no
UN
IVERSITE
AIC
ETET FOR M
ILJ
The likelihood ratio test is useful for comparing two nested models. What if we have many models to compare?
JØ-
OG
BIOVIT
Akaike’s Information Criterion , AIC, is closely connected to the Likelihood ratio statistic:
TENSKAP where p is the number of model parameters.
For BIC the latter term is :
Hence, the AIC is a model selection criterion for likelihood-based estimation.
Benefit: Quick to compare many related (nested) modelsBenefit: Quick to compare many related (nested) models, penalizes complexity.
There are many related criteria to AIC: AICc, BIC, etc
www.umb.no
A typical approach is to select the model with smallest AIC as the best model (from a likelihood point of view)
UN
IVERSITE Some problems with AICETET FOR M
ILJ
Some problems with AIC
AIC should only be used to compare nested models (after my talk I have registrered that this might be wrong… JØ
-O
G BIO
VIT
y g g gMore investigations needed…!)
Bad small sample behaviour; better to use the modified AICc fo small samples
TENSKAP
AICc for small samples.
AIC is a random variable. Are models with similar AIC really different?y
On the latter problem there is an approximate solution for simple cases.
Assume there are two nested models with p1 and p2parameters respectively and let p1>p2. Then
www.umb.no
where
UN
IVERSITE Simulation exampleETET FOR M
ILJ
Simulation example
Data were simulated from a model
Two models were fitted to the dataJØ-
OG
BIOVIT
Two models were fitted to the data
– (p1=3)
– (p2=2)TENSKAP
(p2 2)
The value of was computed
The procedure was repeated 2500 timesp p
Next slide shows a histogram of the differences along with a -distribution superimposed.
Also the critical value for a significant difference is shown as the upper 5% percentile of the distribution (blue line)
www.umb.no
UN
IVERSITE Simulation exampleETET FOR M
ILJ
Simulation example
JØ-
OG
BIOVITTEN
SKAP
www.umb.no
UN
IVERSITE Non-nested models?ETET FOR M
ILJ
Non nested models?
How to compare non-nested models?A th l ti t thi ?
JØ-
OG
BIOVIT
– Are there any solutions to this?
– Can likelihoods be compared across non-nested models?
Vuong’s non-nested hypothesis test (R-library pcsl)TENSKAP
Vuong s non-nested hypothesis test… (R-library pcsl)
(Vuong, Q.H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica.
)57:307-333)
My (limited) experience: Tendency to select too complex modelsmodels
www.umb.no
UN
IVERSITE What about prediction?ETET FOR M
ILJ
What about prediction?
A model which predicts new data well must be a good model, right?JØ
-O
G BIO
VIT
, g
Why not select the model with smallest prediction error?
Popular among statisticians.TENSKAP ANY model which may provide predictions may be
compared!!! (No nesting required)
But…need new data…
Ci it i F t f l ith b dCiriterion: For a set of m new samples with observed values choose the model minimizing the mean squared error of prediction:
www.umb.no
UN
IVERSITE Hmm no new data ? Cross-validation!ETET FOR M
ILJ
Hmm, no new data…? Cross validation!
If new data are not obtainable, cross-validation may often be used.JØ
-O
G BIO
VIT
Simplest version (Leave-one-out cross-validation):
1. Leave one sample out of your data and fit the models TENSKAP
the n-1 remaining samples.
2. Predict the left out sample and record the squared errorfor this sample for all candidate models.for this sample for all candidate models.
3. Put the sample back in and remove another, repeat 2.
4. When all samples have been left out once, compute for each model the cross-validated MSEP:
www.umb.no
5. Compare the models
UN
IVERSITE MSEP(CV) draw-backsETET FOR M
ILJ
MSEP(CV) draw backs
The distribution of MSEP(CV) is unknown. Difficult to assess whether models are significantly different.JØ
-O
G BIO
VIT
g y
Cross-validation may be difficult in situastion where observations are partly dependent, e.g. for models with andom effects
TENSKAP
random effects.
– Then dependent blocks of data should be left out at each step
For large sample sizes cross-validation may take long time.
– The larger blocks of data may be left out in each iteration
Cross-validation routines are often not readily available in software. Requires some programming (e.g. in R)
www.umb.no
software. Requires some programming (e.g. in R)
UN
IVERSITE
A simulation study based on real data - bears
ETET FOR M
ILJ
Sex Head.L Head.W Neck.G Length Chest.G
0 10 5 15 45 23
1 11 6.5 20 47.5 24
1 12 6 17 57 27
1 12 5 5 20 5 59 5 38
JØ-
OG
BIOVIT
1 12.5 5 20.5 59.5 38
1 12 6 18 62 31
0 11 5.5 16 53 26
0 12 5.5 17 56 30.5
0 16.5 9 28 67.5 45TENSKAP
0 16.5 9 27 78 49
0 15.5 8 31 72 54
0 17 10 31.5 72 49
0 15.5 7.5 32 75 54.5
0 17.5 8 32 75 55
0 15 9 33 75 49
1 15.5 6.5 22 62 35
1 13 7 21 70 41
0 15 6.5 28 78 450 15 6.5 28 78 45
0 15 7.5 26.5 73.5 41
0 13.5 8 27 68.5 49
1 13.5 7 20 64 38
1 12.5 6 18 58 31
www.umb.no
0 16 9 29 73 44
0 9 4.5 13 37 19
1 12.5 4.5 10.5 63 32
0 14 5 21.5 67 37
UN
IVERSITE Simulation study 1ETET FOR M
ILJ
Simulation study 1
”log-weights” were simulated according to the model
JØ-
OG
BIOVIT
where
This is the true model which should be selected by theTENSKAP
This is the true model which should be selected by the various methods
Model selection was performed by the choosing the model among all predictor combinations with
– Minimum residual standard deviation, s
Maximum R2 and adjusted R2– Maximum R2 and adjusted R2
– Minimum AIC and BIC
– Minumum MSEPCV
www.umb.no
Minumum MSEPCV
The simulation was repeated 1000 times
UN
IVERSITE Results – Study 1ETET FOR M
ILJ
Results Study 1
Mean model complexity of chosen model (true model=1)
2 2 dj AIC BIC MSEPCVJØ-
OG
BIOVIT
s r2 r2adj AIC BIC MSEPCV 2.767 6.000 2.767 2.104 1.521 1.969Rate of selecting the true modelTEN
SKAP
Rate of selecting the true model
s r2 r2adj AIC BIC MSEPCV 0.146 0.000 0.146 0.341 0.544 0.3560.146 0.000 0.146 0.341 0.544 0.356
www.umb.no
UN
IVERSITE Study 2ETET FOR M
ILJ
Study 2
”log-weights” were simulated according to the model
JØ-
OG
BIOVIT
where
TENSKAP Mean model complexity of chosen model (true model=2)
s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 2.947 6.000 2.947 2.386 1.864 2.225Rate of selecting the true modelg
s r2 r2adj AIC BIC MSEPCV 0.132 0.000 0.132 0.199 0.214 0.199
www.umb.no
UN
IVERSITE Study 3ETET FOR M
ILJ
Study 3
”log-weights” were simulated according to the model
JØ-
OG
BIOVIT
where for the 25 bears in the data set.
TENSKAP Mean model complexity of chosen model (true model=3)
s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 3.356 6.000 3.356 2.814 2.156 2.681
Rate of selecting the true model
s r2 r2adj AIC BIC MSEPCV
www.umb.no
0.148 0.000 0.148 0.165 0.133 0.177
UN
IVERSITE Study 4ETET FOR M
ILJ
Study 4
”log-weights” were simulated according to the model
JØ-
OG
BIOVIT
where for the 25 bears in the data set.
TENSKAP Mean model complexity of chosen model (true model=4)
s r2 r2adj AIC BIC MSEPCVs r2 r2adj AIC BIC MSEPCV 4.057 6.000 4.057 3.545 2.937 3.310 Rate of selecting the true modelg
s r2 r2adj AIC BIC MSEPCV 0.237 0.000 0.237 0.217 0.155 0.166
www.umb.no
UN
IVERSITE Model averagingETET FOR M
ILJ
Model averaging
Is there such a thing as a ”true model”?
Maybe better to find a set of ”good approximations”?JØ-
OG
BIOVIT
Maybe better to find a set of good approximations ?
Should we stick to one of these, or could the top candidates be combined?TEN
SKAP Model averaging deals with combining models for various objectives:
– For inference on parameters of nested models: Gives better quantification of uncertainty of model parameters. Uncertainty measures given the selected model does not
l d d l linclude your model selection uncertainty…
– For prediction: Combining models may improve prediction performance.
www.umb.no
p
UN
IVERSITE Bayesian Model Averaging (BMA)ETET FOR M
ILJ
Bayesian Model Averaging (BMA)
This is inference under the Bayesian estimation paradigm
Strategy:JØ-
OG
BIOVIT
Strategy:
– Choose prior probabilities for the set of candidate models, .TEN
SKAP
– Choose prior probabilities for the model parameters
– By means of Bayes theorem a posterior distribution of the model parameters may (usually) be derived whichmodel parameters may (usually) be derived which incorporates the model uncertainty.
– Estimation may be complicated and often involves Markov Chain Monte Carlo methods (MCMC) to approximate the posterior distributions.
www.umb.no
UN
IVERSITE Frequentis Model Averaging (FMA)ETET FOR M
ILJ
Frequentis Model Averaging (FMA)
Same objectives as BMA
Many proposed methods including AIC based weightedJØ-
OG
BIOVIT
Many proposed methods, including AIC-based weighted average (Buckland, Burnham and Augustin, 1997) and recent work by Hjort and Claeskens.
TENSKAP
Typically an FMA estimator for a parameter is derived as a weighted average of model spesific estimators:
The weights may be defined in many ways butThe weights, , may be defined in many ways, but should sum to 1.
www.umb.no
UN
IVERSITE AIC-based weightsETET FOR M
ILJ
AIC based weights
Bukland, Burnham and Augustin suggest AIC-basedweights defined asJØ
-O
G BIO
VIT
g
TENSKAP
This is motivated from a well known result on BIC
The approximate posterior probability of a model m beingt icorrect is
www.umb.no
UN
IVERSITE Example – Bears modellingETET FOR M
ILJ
Example Bears modelling
Reconsider the bears data
We use a data set from Study 4 and selectJØ-
OG
BIOVIT
We use a data set from Study 4 and select
– The best model (AIC) and report parameter estimates
– The best 5 models (AIC) and report averaged estimatesTENSKAP
The best 5 models (AIC) and report averaged estimates
www.umb.no
UN
IVERSITE Results – Model averagingETET FOR M
ILJ
Results Model averaging
Best AIC values
19 46738 19 90066 20 48747 20 50365 21 66067JØ-
OG
BIOVIT
19.46738 19.90066 20.48747 20.50365 21.66067Weights
0.299822 0.241422 0.180033 0.178583 0.100137TENSKAP Parameter estimates
intercept Sex Head.L Head.W Neck.G Length Chest.G
True 1.450 0.400 0.000 0.000 0.040 0.045 -0.200
Top AIC 0.055 0.525 0.115 0.190 0.000 0.053 -0.231
Top5 AIC 0.172 0.560 0.074 0.166 0.021 0.058 -0.236
www.umb.no
UN
IVERSITE Concluding remarksETET FOR M
ILJ
Concluding remarks
Averaging for inference on parameters only sensible for nested modelsJØ
-O
G BIO
VIT
For non-nested models it is still a good idea to do model averaging for prediction purposes!
TENSKAP
Then the predicted values should be averaged!
Also known in classification problems as ”majority voting”voting .
Main lesson: There is no best way of choosing modelsMain lesson: There is no best way of choosing models
The method should reflect what the analyst thinks is important!
www.umb.no
UN
IVERSITE Some literatureETET FOR M
ILJ
Some literature
Hjort and Claeskens, Frequentist Model AverageEstimators, JASA, 2003JØ
-O
G BIO
VIT
, ,
Burnham and Anderson, Model selection and MultimodelInference: A Practical Information-Theoretic Approach, S i N Y k 2002
TENSKAP
Springer, New York, 2002.
Buckland, Burnham and Augustin, Model selection: An integral part of inference, Biometrics, 1997 .g p , ,
Ripley, Pattern Recognition and Neural Networks, Cambridge Univ Press, 1996
www.umb.no