Model Assessment and Selection: Subset, Ridge, Lasso, and PCR
Transcript of Model Assessment and Selection: Subset, Ridge, Lasso, and PCR
![Page 1: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/1.jpg)
Model Assessment and Selection:Subset, Ridge, Lasso, and PCR
Yuan Yao
Department of MathematicsHong Kong University of Science and Technology
Most of the materials here are from Chapter 5-6 of Introduction to Statisticallearning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
Spring, 2020
![Page 2: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/2.jpg)
Yesterday is a bad day...
2
![Page 3: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/3.jpg)
Model AssessmentCross ValidationBootstrap
Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression
3
![Page 4: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/4.jpg)
Outline
Model AssessmentCross ValidationBootstrap
Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression
Model Assessment 4
![Page 5: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/5.jpg)
Training error is not sufficient enough
I Training error easily computable with training data.
I Because of possibility of over-fit, it cannot be used to properlyassess test error.
I It is possible to “estimate” the test error, by, for example,making adjustments of the training error.
– The adjusted R-squared, Cp, AIC, BIC, etc serve thispurpose.
I General purpose method of prediction/test error estimate:validation.
Model Assessment 5
![Page 6: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/6.jpg)
Ideal scenario for performance assessment
I In a “data-rich” scenario, we can afford to separate the datainto three parts:
– training data: used to train various models.– validation data: used to assess the models and identify the
best.– test data: test the results of the best model.
I Usually, people also call validation data or hold-out data astest data.
Model Assessment 6
![Page 7: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/7.jpg)
Cross validation: overcome the drawback of validation
set approach
I Our ultimate goal is to produce the best model with bestprediction accuracy.
I Validation set approach has a drawback of using ONLYtraining data to fit model.
I The validation data do not participate in model building butonly model assessment.
I A “waste” of data.
I We need more data to participate in model building.
Model Assessment 7
![Page 8: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/8.jpg)
K-fold cross validation
I Divide the data into K subsets, usually of equal or similar sizes(n/K ).
I Treat one subset as validation set, the rest together as atraining set. Run the model fitting on training set. Calculatethe test error estimate on the validation set, denoted asMSEi , say.
I Repeat the procedures over every subset.
I Average over the above K estimates of the test errors, andobtain
CV(K) =1
K
K∑i=1
MSEi
I Leave-One-Out Cross Validation (LOOCV) is a special case ofK-fold cross validation, actually n-fold cross validation.
Model Assessment 8
![Page 9: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/9.jpg)
K-fold cross validation
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
Figure: 5.5. A schematic display of 5-fold CV. A set of n observations israndomly split into five non-overlapping groups. Each of these fifths actsas a validation set (shown in beige), and the remainder as a training set(shown in blue). The test error is estimated by averaging the fiveresulting MSE estimates.
Model Assessment 9
![Page 10: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/10.jpg)
Inference of Estimate Uncertainty
I Suppose we have data x1, ..., xn, representing the ages of nrandomly selected people in HK.
I Use sample mean x to estimate the population mean µ, theavearge age of all residents of HK.
I How to justify the estimation error x − µ? Usually byt-confidence interval, test of hypothesis.
I They rely on normality assumption or central limit theorm.
I Is there another reliable way?
I Just bootstrap:
Model Assessment 10
![Page 11: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/11.jpg)
Boostrap as a resampling procedure.
I Take n random sample (with replacement) from x1, ..., xn.
I calculate the sample mean of the “re-sample”, denoted as x∗1 .
I Repeat the above a large number M times. We havex∗1 , x
∗2 , ..., x
∗M .
I Use the distribution of x∗1 − x , ..., x∗M − x to approximate thatof x − µ.
Model Assessment 11
![Page 12: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/12.jpg)
I Essential idea: Treat the data distribtion (more professionallycalled empirical distributoin) as a proxy of the populationdistribution.
I Mimic the data generation from the true population, by tryingresampling from the empirical distribution.
I Mimic your statistical procedure (such as computing anestimate x) on data, by doing the same on the resampleddata.
I Evalute your statistical procedure (which may be difficultbecause it involves randomness and the unknown populationdistribution) by evaluting your analogue procudres on there-samples.
Model Assessment 12
![Page 13: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/13.jpg)
Example
I X and Y are two random variables. Then minimizer ofvar(αX + (1− α)Y )) is
α =σ2Y − σXY
σ2X + σ2Y − 2σXY
I Data: (X1,Y1), ..., (Xn,Yn).
I We can compute sample variances and covariances.
I Estimate α by
α =σ2Y − σXY
σ2X + σ2Y − 2σXY
I How to evalute α− α, (remember α is random and α isunknown).
I Use Bootstrap
Model Assessment 13
![Page 14: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/14.jpg)
Example
I Sample n resamples from (X1,Y1), ..., (Xn,Yn), and computethe sample the sample variance and covariances for thisresample. And then compute
α∗ =(σ∗Y )2 − σ∗XY
(σ∗X )2 + (σ∗Y )2 − 2σ∗XYI Repeat this procedure, and we have α∗
1, ..., α∗M for a large M.
I Use the distribution of α∗1 − α, ..., α∗
M − α to approximate thedistribution of α− α.
I For example, we can use
1
M
M∑j=1
(α∗j − α)2
to estimate E (α− α)2.I Use Bootstrap
Model Assessment 14
![Page 15: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/15.jpg)
0.4 0.5 0.6 0.7 0.8 0.9
050
100
150
200
0.3 0.4 0.5 0.6 0.7 0.8 0.9
050
100
150
200
True Bootstrap
0.3
0.4
0.5
0.6
0.7
0.8
0.9
αα
α
Figure: 5.10. Left: A histogram of the estimates of α obtained bygenerating 1,000 simulated data sets from the true population. Center: Ahistogram of the estimates of α obtained from 1,000 bootstrap samplesfrom a single data set. Right: The estimates of α displayed in the leftand center panels are shown as boxplots. In each panel, the pink lineindicates the true value of α.
Model Assessment 15
![Page 16: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/16.jpg)
2.8 5.3 3
1.1 2.1 2
2.4 4.3 1
Y X Obs
2.8 5.3 3
2.4 4.3 1
2.8 5.3 3
Y X Obs
2.4 4.3 1
2.8 5.3 3
1.1 2.1 2
Y X Obs
2.4 4.3 1
1.1 2.1 2
1.1 2.1 2
Y X Obs
Original Data (Z)
1*Z
2*Z
Z*B
1*α
2*α
α*B
!!
!!
!!
!!
!
!!
!!
!!
!!
!!
!!
!!
!!
Model Assessment 16
![Page 17: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/17.jpg)
Figure 5.11. A graphical illustration of the bootstrap approach ona small sample containing n = 3 observations. Each bootstrapdata set contains n observations, sampled with replacement fromthe original data set. Each bootstrap data set is used to obtain anestimate of α.
Model Assessment 17
![Page 18: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/18.jpg)
Outline
Model AssessmentCross ValidationBootstrap
Model SelectionSubset selectionRidge RegressionThe LassoPrincipal Component Regression
Model Selection 18
![Page 19: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/19.jpg)
Interpretability vs. Prediction
Flexibility
Inte
rpre
tabili
ty
Low High
Low
Hig
h Subset SelectionLasso
Least Squares
Generalized Additive ModelsTrees
Bagging, Boosting
Support Vector Machines
Figure: 2.7. As models become flexible, interpretability drops. OccamRazor principle: Everything has to be kept as simple as possible, but notsimpler (Albert Einstein).
Model Selection 19
![Page 20: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/20.jpg)
About this chapter
I Linear model already addressed in detail in Chapter 3.
Y = β0 + β1X1 + ...+ βpXp + ε
I Model assessment: cross-validation (prediction) error inChapter 5.
I This chapter is about model selection for linear models.
I The model selection techniques can be extended beyond linearmodels.
I Details about AIC, BIC, Mallow’s Cp mentioned in Chapter 3.
Model Selection 20
![Page 21: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/21.jpg)
Feature/variable selection
I Not all existing input variables are useful for predicting theoutput.
I Keeping redundant inputs in model can lead to poorprediction and poor interpretation.
I We consider three ways of variable/model selection:
– Subset selection.– Shrinkage/regularization: constraining some regression
parameters to 0, e.g. Ridge and Lasso.– Dimension reduction: actually using the “derived inputs” by,
for example, principle component approach.
Model Selection 21
![Page 22: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/22.jpg)
Best subset selection
I Exhaust all possible combinations of inputs.
I With p variables, there are 2p many distinct combinations.
I Identify the best model among these models.
Model Selection 22
![Page 23: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/23.jpg)
Pros and Cons of best subset selection
I Seems straightforward to carry out.
I Conceptually clear.
I The search space too large (2p models), may lead to overfit.
I Computationally infeasible: too many models to run.
I if p = 20, there are 220 > 1000, 000 models.
Model Selection 23
![Page 24: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/24.jpg)
Forward stepwise selection
I Start with the null model.
I Find the best one-variable model.
I With the best one-varialbe model, add one more variable toget the best two-variable model.
I With the best two-varialbe model, add one more variable toget the best three-variable model.
I ....
I Find the best among all these best k-variable models.
Model Selection 24
![Page 25: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/25.jpg)
Pros and Cons of forward stepwise selection
I Less computation
I Less models (∑p−1
k=0(p − k) = 1 + p(p + 1)/2 models).
I (if p = 20, only 211 models, compared with more than 1million models for best subset selection).
I No problem for first n-steps if p > n.
I Once an input is in, it does not get out.
Model Selection 25
![Page 26: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/26.jpg)
Backward stepwise selection
I Start with the largest model (all p inputs in).
I Find the best (p − 1)-variable model, by reducing one fromthe largest model
I Find the best (p − 2)-variable model, by reducing one variablefrom the best (p − 1)-variable model.
I Find the best (p − 3)-variable model, by reducing one variablefrom the best (p − 2)-variable model.
I ....
I Find the best 1-varible model, by reducing one variable fromthe best 2-variable model.
I The null model.
Model Selection 26
![Page 27: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/27.jpg)
Pros and Cons of backward stepwise selection
I Less computation
I Less models (∑p−1
k=0(p − k) = 1 + p(p + 1)/2 models).
I (if p = 20, only 211 models, compared with more than 1million models for best subset selection).
I Once an input is out, it does not get in.
I No applicable to the case with p > n.
Model Selection 27
![Page 28: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/28.jpg)
Find the best model based on prediction error.
I General approach by Validation/Cross-Validation (addressedin ISLR Chapter 5).
I Model-based approach by Adjusted R2, AIC, BIC or Cp (ISLRChapter 3).
Model Selection 28
![Page 29: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/29.jpg)
R-squared
I Residue
εi = yi − β0 −p∑
j=1
βjxij
I Residual Sum of Squares as the Training Error
RSS =n∑
i=1
ε2i =n∑
i=1
(yi − β0 −p∑
j=1
βjxij)2
I R-squared
R2 = 1− RSS
TSS
where the Total Sum of Squares TSS :=∑n
i=1(yi − y)2.
Model Selection 29
![Page 30: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/30.jpg)
Example: Credit data
2 4 6 8 10
2e+
07
4e+
07
6e+
07
8e+
07
Number of Predictors
Resid
ual S
um
of S
quare
s
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Number of Predictors
R2
Figure: 6.1. For each possible model containing a subset of the tenpredictors in the Credit data set, the RSS and R2 are displayed. The redfrontier tracks the best model for a given number of predictors, accordingto RSS and R2. Though the data set contains only ten predictors, thex-axis ranges from 1 to 11, since one of the variables is categorical andtakes on three values, leading to the creation of two dummy variables.
Model Selection 30
![Page 31: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/31.jpg)
The issues of R-squared
I The R-squared is the percentage of the total variation inresponse due to the inputs.
I The R-squared reflects the training error.
I However, a model with larger R-squared is not necessarilybetter than another model with smaller R-squared when weconsider test error!
I If model A has all the inputs of model B, then model A’sR-squared will always be greater than or as large as that ofmodel B.
I If model A’s additional inputs are entirely uncorrelated withthe response, model A contain more noise than model B. As aresult, the prediction based on model A would inevitably bepoorer or no better.
Model Selection 31
![Page 32: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/32.jpg)
a) Adjusted R-squared
I The adjusted R-squared, taking into account of the degrees offreedom, is defined as
adjusted R2 = 1− RSS/(n − p − 1)
TSS/(n − 1)
I With more inputs, the R2 always increase, but the adjustedR2 could decrease since more inputs is penalized by thesmaller degree of freedom of the residuals.
I Maximizing the adjusted R-squared, is equivalent to
minimize RSS/(n − p − 1).
Model Selection 32
![Page 33: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/33.jpg)
b) Mallow’s Cp
I Recall that our linear model (2.1) has p covariates, andσ2 = RSS/(n − p − 1) is the unbiased estimator of σ2.
I Suppose we use only d predictors and RSS(d) is the residualsum of squares for the linear model with d predictors.
I The statistic of Mallow’s Cp is defined as
Cp =RSS(d) + 2d σ2
n
I The smaller Mallows’ Cp is, the better the model is.
I The following AIC is more often used, despite that Mallows’Cp and AIC usually give the same best model.
Model Selection 33
![Page 34: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/34.jpg)
c) AIC
I AIC stands for Akaike information criterion, which is definedas
AIC =RSS(d) + 2d σ2
nσ2,
for a linear model with d ≤ p predictors, whereσ2 = ‖y − y‖2/(n − p − 1) is the unbiased estimator of σ2
using the full model.
I AIC aims at maximizing the expected predictive likelihood (orminimizing the expected predictive error). The model with thesmallest AIC is preferred.
Model Selection 34
![Page 35: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/35.jpg)
d) BIC
I BIC stands for Schwarz’s Bayesian information criterion,which is defined as
BIC =RSS(d) + d σ2 log(n)
nσ2,
for a linear model with d inputs.
I The model with the smallest BIC is preferred. The derivationof BIC results from Bayesian statistics and has Bayesianinterpretation. It is seen that BIC is formally similar to AIC.The BIC penalizes more heavily the models with more numberof inputs.
Model Selection 35
![Page 36: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/36.jpg)
Penalized log-likelihood
I In general AIC/BIC are penalized maximum likelihood, e.g.BIC aims
minimize− (log likelihood) + d log(n)/n
where, the first term is called deviance (some refer it to−2 log likelihood). In the case of linear regression withnormal errors, the deviance is the same as log(s2).
Model Selection 36
![Page 37: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/37.jpg)
Example: credit dataset
Variables Best subset Forward stepwise
one rating ratingtwo rating, income rating, incomethree rating, income, student rating, income, studentfour cards, income, student, limit rating, income, student, limit
TABLE 6.1. The first four selected models for best subset selectionand forward stepwise selection on the Credit data set. The firstthree models are identical but the fourth models differ.
Model Selection 37
![Page 38: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/38.jpg)
Example
2 4 6 8 10
10
01
20
14
01
60
18
02
00
22
0
Number of Predictors
Sq
ua
re R
oo
t o
f B
IC
2 4 6 8 10
10
01
20
14
01
60
18
02
00
22
0Number of Predictors
Va
lida
tio
n S
et
Err
or
2 4 6 8 10
10
01
20
14
01
60
18
02
00
22
0
Number of Predictors
Cro
ss−
Va
lida
tio
n E
rro
r
Figure: 6.3. For the Credit data set, three quantities are displayed for thebest model containing d predictors, for d ranging from 1 to 11. Theoverall best model, based on each of these quantities, is shown as a bluecross. Left: Square root of BIC. Center: Validation set errors (75%training data). Right: 10-fold Cross-validation errors.
Model Selection 38
![Page 39: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/39.jpg)
The one standard deviation rule
I In the above figure, model with 6 inputs do not seem to bemuch better than model with 4 or 3 inputs.
I Keep in mind the Occam’s razor: Choose the simplest modelif they are similar by other criterion.
Model Selection 39
![Page 40: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/40.jpg)
The one standard deviation rule
I Calculate the standard error of the estimated test MSE foreach model size,
I Consider the models with estimated test MSE of one standarddeviation within the smallest test MSE.
I Among them select the one with the smallest model size.
I (Apply this rule to the Example in Figure 6.3 gives the modelwith 3 variable.)
Model Selection 40
![Page 41: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/41.jpg)
Ridge Regression
I The least squares estimator β is minimizing
RSS =n∑
i=1
(yi − β0 −p∑
j=1
βjxij)2
I The ridge regression βRλ is minimizing
n∑i=1
(yi − β0 −p∑
j=1
βjxij)2 + λ
p∑j=1
β2j
where λ ≥ 0 is a tuning parameter.I The first term measuares goodness of fit, the smaller the
better.I The second term λ
∑pj=1 β
2j is called shrikage penalty, which
shrinks βj towards 0.I The shrinkage reduces variance (at the cost increased bias)!
Model Selection 41
![Page 42: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/42.jpg)
Tuning parameter λ.
I λ = 0: no penalty, βR0 = βLS .
I λ =∞: infinity penalty, βR0 = 0.
I Large λ: heavy penalty, more shrinkage of the estimator.
I Note that β0 is not penalized.
Model Selection 42
![Page 43: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/43.jpg)
Remark.
I If p > n, ridge regression can still perform well by trading offa small increase in bias for a large decrease in variance.
I Ridge regression works best in situations where the leastsquares estimates have high variance.
I Ridge regression also has substantial computationaladvantages
I
βRλ = (XTX) + λI )−1XTy
where I is p + 1 by p + 1 diagonal with diagonal elements(0, 1, 1, ..., 1).
Model Selection 43
![Page 44: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/44.jpg)
Example: Ridge Regularization Path in Credit data
1e−02 1e+00 1e+02 1e+04
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
IncomeLimitRatingStudent
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
λ ‖βRλ ‖2/‖β‖2
Figure: 6.4. The standardized ridge regression coefficients are displayedfor the Credit data set, as a function of λ and ‖βR
λ ‖2/‖β‖2. Here
‖a‖2 =√∑p
j=1 a2j .
Model Selection 44
![Page 45: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/45.jpg)
The Lasso
I Lasso stands for Least Absolute Shrinkage and SelectionOperator.
I The Lasso estimator βLλ is the minimizer of
n∑i=1
(yi − β0 −p∑
j=1
βjxij)2 + λ
p∑j=1
|βj |
I We may use ‖β‖1 =∑p
j=1 |βj |, which is the l1 norm.
I LASSO often shrinks coefficients to be identically 0. (This isnot the case for ridge)
I Hence it performs variable selection, and yields sparse models.
Model Selection 45
![Page 46: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/46.jpg)
Example: Lasso Path in Credit data.
20 50 100 200 500 2000 5000
−200
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
0.0 0.2 0.4 0.6 0.8 1.0
−300
−100
0100
200
300
400
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
IncomeLimitRatingStudent
λ ‖βLλ ‖1/‖β‖1
Figure: 6.6. The standardized lasso coefficients on the Credit data set areshown as a function of λ and ‖βL
λ‖1/‖β‖1.
Model Selection 46
![Page 47: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/47.jpg)
Another formulation
I For Lasso: Minimize
n∑i=1
(yi − β0 −p∑
j=1
βjxij)2 subject to
p∑j=1
|βj | ≤ s
I For Ridge: Minimize
n∑i=1
(yi − β0 −p∑
j=1
βjxij)2 subject to
p∑j=1
β2j ≤ s
I For l0: Minimizen∑
i=1
(yi − β0 −p∑
j=1
βjxij)2 subject to
p∑j=1
I (β 6= 0) ≤ s
l0 method penalizes number of non-zero coefficients. Adifficult (NP-hard) problem for optimization.
Model Selection 47
![Page 48: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/48.jpg)
Variable selection property for Lasso
Figure: 6.7. Contours of the error and constraint functions for the lasso(left) and ridge regression (right). The solid blue areas are the constraintregions, |β1|+ |β2| ≤ s and β2
1 + β22 ≤ s , while the red ellipses are the
contours of the RSS.
Model Selection 48
![Page 49: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/49.jpg)
Simple cases
I Consider the simple model yi = βi + εi , i = 1, ..., n and n = p.Then,
– The least squares: βj = yj ;
– The ridge: βRj = yj/(1 + λ);
– The Lasso: βLj = sign(yj)(|yj | − λ/2)+.
I Slightly more generally, suppose input columns of the X arestandardized to be mean 0 and variance 1 and are orthogonal.
βRj = βLSEj /(1 + λ)
βLj = sign(βLSEj )(|βLSEj | − λ/2)+
for j = 1, ..., p.
Model Selection 49
![Page 50: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/50.jpg)
Example
−1.5 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.5
Coeffic
ient E
stim
ate
Ridge
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5
−1.5
−0.5
0.5
1.5
Coeffic
ient E
stim
ate
Lasso
Least Squares
yjyj
Figure: 6.10. The ridge regression and lasso coefficient estimates for asimple setting with n = p and X a diagonal matrix with 1 on thediagonal. Left: The ridge regression coefficient estimates are shrunkenproportionally towards zero, relative to the least squares estimates. Right:The lasso coefficient estimates are soft-thresholded towards zero.
Model Selection 50
![Page 51: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/51.jpg)
Dimension reduction methods (using derived inputs)
I When p is large, we may consider to regress on, not theoriginal inputs x , but some small number of derived featuresφ1, ..., φk with k < p.
yi = θ0 +k∑
j=1
θjφj(xi ) + εi , i = 1, ..., n.
– φj can be linear: linear combinations of X1, . . . ,Xp
– φj can be nonlinear: basis, kernels, neural networks, trees, etc.
Model Selection 51
![Page 52: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/52.jpg)
Principal Component Analysis (PCA)
I Suppose there are n observations of p variables presented asX = (x1, . . . xn)T ∈ Rn×p, where xTi ∈ Rp.
I Define the sample covariance matrix
Σ =1
n
n∑i=1
(xi − µ)T (xi − µ)
where the sample mean µ = 1n
∑i xi .
I Σ has an eigenvalue decomposition
Σ = UΛUT ,
with UTU = Ip (U = [u1, . . . , up]), Λ = diag(λ1, . . . , λp),λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0.
Model Selection 52
![Page 53: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/53.jpg)
Principal Component Regression
For X = (X1, . . . ,Xp),
I Define φj be the projection on the j-th eigenvector ofcentralized data:
Zj = φj(X ) = uTj (X − µ)
I Principal Component Regression (PCR) model:
yi = θ0 +k∑
j=1
θjZj + εi , i = 1, ..., n.
Model Selection 53
![Page 54: Model Assessment and Selection: Subset, Ridge, Lasso, and PCR](https://reader033.fdocuments.us/reader033/viewer/2022050100/626c3a548c46884acb52f454/html5/thumbnails/54.jpg)
A summary table of PCs
eigenvalue eigenvector percent of P.C.s as(variance) (combination variation projections
coefficient) explained of X − µ1st P.C. Z1 λ1 u1 λ1/
∑pj=1 λj Z1 = u′1(X − µ)
2nd P.C. Z2 λ2 u2 λ2/∑p
j=1 λj Z2 = u′2(X − µ)...
......
......
...p-th P.C. Zp λp up λp/
∑pj=1 λj Zp = u′p(X − µ)
I where top k principal components explained the followingpercentage of total variations
k∑j=1
λj/trace(Σ)
Model Selection 54