IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Lecture 15: Model Selection and Validation, CrossValidation
Hao Helen Zhang
Fall, 2017
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Outline
Read: ESL 7.4, 7.10
better understand CV
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
In general,
the linear regression method like OLS has low bias (if the truemodel is linear) but high variance (especially if data dimensionis high) – leading to poor predictions.
Modern methods, such as LASSO and ridge regression, solve
minβ‖y − Xβ‖22 + λJ(β).
They introduce some bias to reduce variance, leading tobetter prediction accuracy.
J is called a penalty function.Regularization can help to achieve sparsity or smoothness inestimation.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Model Performance Measurements
Given the data (xi , yi ), i = 1, · · · , n, where xi is the input and yi isthe response, we assume
yi = f (xi ) + εi , i = 1, . . . , n.
We fit the model using some method, and obtain f . If the model isindexed by a parameter, we denote the fitted model by fα or fλ.
For example,
the OLS estimator: f (x) = xT βols
.
the LASSO estimator fλ(x) = xT βlassoλ .
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Bias-Variance Trade-off
As complexity increases, the model tends to
adapt to more complicated underlying structures (decrease inbias)
have increased estimation error (increase in variance)
In between there is an optimal model complexity, giving theminimum test error.
Role of tuning parameter α or λ:
Tuning parameter varies the model complexity
Find the best α to produce the minimum of test error
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Model Performance Evaluation
How do we evaluate the performance of f ? Consider
its generalization performance
its prediction capability on an independent set of data, theso-called “test data”
Model assessment is extremely important, as it
measures the quality of the final model.
allows us to compare different methods and make an optimalchoice.
guides the choice of tuning parameters in regularizationmethods.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Model Selection and Assessment
Two Goals:
Model Selection: estimate the performance of differentmodels and choose the best model
Model Assessment: after choosing a final model, estimate itsprediction error (generalization performance) on new data.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Common Techniques
The following are commonly used to estimate the prediction error
Validation set (data-rich situation)
Resampling methods:
cross validationbootstrap
Analytical estimation criteria
Next, we will introduce their basic ideas.
See Efron (2004, JASA) for more details.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Different Types of Errors
In machine learning, we encounter different types of “errors”
training error
test error, prediction error
tuning error
cross validation error
Example: If Y is continuous, two commonly-used loss functions are
squared error L(Y , f (X)) = (Y − f (X))2
absolute error L(Y , f (X)) = |Y − f (X)|
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Training error and Test error
Training error: the average loss over the training samples
TrainErr =1
n
n∑i=1
L(yi , f (xi ))
Prediction error: the expected error over all the input
PE = E(X,Y ′,data)[1
n
n∑i=1
L(Y ′i , f (Xi ))],
where Y ′i = f (Xi ) + ε′i , i = 1, · · · , n, are independent ofy1, · · · , yn. Expectation is taken w.r.t. all random quantities.Test error: Given a test data set (xi , y
′i ) for i = 1, . . . , n′,
TestErr =1
n′
n′∑i=1
L(Y ′i , f (Xi )).
TestErr is an estimate of PE. The larger n′, the betterestimate.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Classification Problems: Y is discrete
If Y takes values {1, ...,K}, the common loss functions are
0− 1 loss L(Y , f (X) = I (Y 6= f (X))
log-likelihood L(Y , p(X) = −2∑K
k=1 I (Y = k) log pk(X)= −2 log pY (X)
With the log-likelihood loss,Training error is the sample log-likelihood
TrainErr = −2
n
n∑i=1
log pyi (xi )
Log-likelihood is referred to as cross-entropy loss or deviance.General densities: Poisson, gamma, exponential, log-normal
L(Y , θ(X)) = −2 log Prθ(X)(Y ).
With 0− 1 loss, the test error is expected misclassification rate
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Training Error vs Test Error
Training error can be easily calculated by applying thestatistical learning method to the training set.
Test error is the average error of a statistical learning methodwhen being used to predict the response for a newobservation, one that was not used in training the method.
TrainErr is not a good estimate of TestErr
The training error rate often is quite different from the testerror rate.
TrainErr can dramatically underestimate TestErr.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
�������� � ������� �� �������� �������� ���������� � �������� ���� ������� �
���� ����� ����� ��
� �������� ����� ��
�������������
����� ���������
���� � � ������
���� ������
� ����
������ �� ���� ��� ����� ����� �� � � ����� ��
����� ����������
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
What is Wrong with Training Error
TrainErr is not an objective measure on the generalizationperformance of the method, as we build f based on thetraining data.
TrainErr decreases with model complexity.
Training error drops to zero if the model is complex enoughA model with zero training error overfits the training data;over-fitted models typically generalize poorly
Question: How do we estimate TestErr accurately? We consider
data-rich situation
data-insufficient situation
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Data-Rich Situation
Key Idea: Estimate the test error by holding out a subset of thetraining observations from the fitting process, and then applyingthe statistical learning method to those held out observations.
Validation-set Approach: without tuning
Randomly divide the available set of samples into two parts: atraining set and a validation or hold-out set.
The model is fit on the training set, and the fitted model isused to predict the responses for the observations in thevalidation set.
The resulting validation-set error provides an estimate of thetest error. This is typically assessed using MSE in the case ofa quantitative response and misclassification rate in the caseof a qualitative (discrete) response.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Validation-set Approach: with tuning
Randomly divide the dataset into three parts:
training set: to fit the models
validation (tuning) set: to estimate the prediction error formodel selection
test set: to assess the generalization error of the final chosenmodel
The typical proportions are respectively: 50%, 25%, 25%.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Drawbacks of Validation-set Approach
The validation estimate of the test error can be highlyvariable, depending on precisely which observations areincluded in the training set and which observations areincluded in the validation set.
In the validation approach, only a subset of the observations -those that are included in the training set rather than in thevalidation set - are used to fit the model.
This suggests that the validation set error may tend tooverestimate the test error for the model fit on the entire dataset.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Data-Insufficient Situation
How much training data is enough?
depends on signal-to-noise ratio of the underlying function
depends on complexity of the models
too difficulty to give a general rule
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Resampling Methods
These methods refit a model of interest to samples formed fromthe training set, and obtain additional information about the fittedmodel.
Nonparametric methods: efficient sample reuse
Cross-validation, bootstrap techniques
They provide estimates of test-set prediction error, and thestandard deviation and bias of our parameter estimates.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
K-fold Cross Validation
The simplest and most popular way to estimate the test error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K
leave the kth portion out, and fit the model using the otherK − 1 parts. Denote the solution by f −k(x)calculate the prediction error of the f −k(x) on the kth(left-out) portion
3 Average the errors
Define the Index function (allocating memberships)
κ : {1, ..., n} → {1, ...,K}.Then the K -fold cross-validation estimate of prediction error is
CV =1
n
n∑i=1
L(yi , f−κ(i)(xi ))
Typical choice: K = 5, 10Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Leave-Out-One (LOO) Cross Validation
If K=n, we have leave-out-one cross validation.
κ(i) = i
LOO-CV is approximately unbiased for the true predictionerror
LOO-CV can have high variance because the n “training sets”are so similar to one another
Computation:
Computational burden is generally considerable, requiring nmodel fitting.
For special models like linear smoothers, computation can bedone quickly. For example: Generalized Cross Validation(GCV)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Statistical Properties of Cross Validation Error Estimates
Cross-validation error is very nearly unbiased for test error.
The slight bias is due to that the training set incross-validation is slightly smaller than the actual data set(e.g. for LOOCV the training set size is n 1 when there are nobserved cases).
In nearly all situations, the effect of this bias is conservative,i.e., the estimated fit is slightly biased in the directionsuggesting a poorer fit. In practice, this bias is rarely aconcern.
The variance of CV-error can be large.
In practice, to reduce variability, multiple rounds ofcross-validation are performed using different partitions, andthe validation results are averaged over the rounds.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Choose α
Given tuning parameter α, the model fit f −k(x, α).The CV (α) curve is
CV (α) =1
n
n∑i=1
L(yi , f−κ(i)(xi , α))
CV (α) provides an estimate of the test error curve
We find the best parameter α by minimizing CV (α).
Final model is f (x, α).
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Performance of CV
Example: Two-class classification problem
Linear model with best subsets regression of subset size p
In Figure 7.9, (p is tuning parameter)
both curves picks best p = 10. (The true model)
CV curve is flat beyond 10.
Standard error bars are the standard errors of the individualmisclassification error rates
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Standard Errors for K-fold CV
Let CVk(fα) denote the validation errors of fα in each fold for α, orsimply CVk(α). Here α ∈ {α1, . . . , αm}, within a set of candidatevalues for α. Then we have
The cross validation error
CV (α) =1
K
K∑k=1
CVk(α).
The standard deviation of CV (α) is calculated as
SE (α) =√
Var(CV (α)).
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
One-standard Error for K-fold CV
Letα = arg min
α∈{α1,...,αm}CV (α).
First we find α; then we move α in the direction of increasingregularization as much as can as long as
CV (α) ≤ CV (α) + SE (α).
choose the most parsimonious model with error no more thanone standard error above the best error.
Idea: “All equal (up to one standard error), go for thesimplest model”.
In this example, one-SE rule CV would choose p = 9.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB
Subset Size p
Miscla
ssifica
tion Er
ror
5 10 15 20
0.00.1
0.20.3
0.40.5
0.6
••
••
••
••
•• • • • • • • • • • •
••
• •
••
• •
• • • • • • •• • • • •
DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
The Wrong and Right Way
Consider a simpler classifier for microarrays:
1 Starting with 5,000 genes, find the top 200 genes having thelargest correlation with the class label
2 Carry about the nearest-centroid classification using top 200genes
How do we select the tuning parameter in the classifier?
Way 1: apply cross-validation to step 2
Way 2: apply cross-validation to steps 1 and 2
Which is right? – Cross validating the whole procedure.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Generalized Cross Validation
If linear fitting methods y = Sy satisfy
yi − f −i (xi ) =yi − f (xi )
1− Sii,
then we do NOT need to train the model n times.Using the estimation Sii = trace(S)/n, we have the approximatescore
GCV =1
n
n∑i=1
[yi − f (xi )
1− trace(S)/n
]2
computational advantage
GCV (α) is always smoother than CV (α)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Application Examples
For regression problems, apply CV to
shrinkage methods: select λ
best subset selection: select the best model
nonparametric methods: select the smoothing parameter
For classification methods, apply CV to
kNN: select k
SVM: select λ
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Model-Based Methods: Information Criteria
Covariance penalties
Cp
AIC
BIC
Stein’s UBR
They can be used for both model selection and test errorestimation.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Generalized Information Criteria (GIC)
Information criteria:
GIC(model) = −2 · loglik + α · df,
the degree of freedom (df) of f as the number of effectiveparameters of the model.
Examples: In linear regression settings, we use
AIC = n log(RSS/n) + 2 · df,
BIC = n log(RSS/n) + log(n) · df
where RSS =∑n
i=1[yi − f (xi )]2 (the residual sum of squares)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Optimism of Training Error Rate
training error:
err =1
n
n∑i=1
L(yi , f (xi ))
test error:Err = E(X,Y ,data)[L(Y , f (X))].
in-sample error rate (an estimate of prediction error)
Errin =1
n
n∑i=1
EynewEyL(ynewi , f (xi ))
Optimismop = Errin − Ey err
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
IntroductionTraining Error vs Test ErrorObjective Tuning Methods
Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria
Optimism
op is typically positive (Why?)
For the squared error, 0-1, and other loss function,
op =2
n
n∑i=1
Cov(yi , yi ).
The amount by which err under-estimates the true errordepends on the strength of correlation yi and its production.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Top Related