Cross-validation for detecting and preventing...

Cross-validation for detecting and preventing

overfittingAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/~awmawm@cs.cmu.edu

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

A Regression Problem

y = f(x) + noise

Can we learn f from this data?

Let’s consider three methods…

Linear Regression

Linear RegressionUnivariate Linear regression with a constant term:

::3173

:37X= y=

x1=(3).. y1=7..

z1=(1,3)..

zk=(1,xk)

y1=7..

β=(ZTZ)-1(ZTy)

yest = β0+ β1 x

Quadratic Regression

::3173

:37X= y=

x1=(3,2).. y1=7..

z=(1 , x, x2,)

β=(ZTZ)-1(ZTy)

yest = β0+ β1 x+ β2 x2

Much more about this in the future Andrew Lecture: “Favorite Regression Algorithms”

Join-the-dots

Also known as piecewise linear nonparametric

regression if that makes you feel better

Which is best?

Why not choose the method with the best fit to the data?

What do we really want?

Why not choose the method with the best fit to the data?

“How well are you going to predict future data drawn from the same

distribution?”

The test set method

1. Randomly choose 30% of the data to be in a test set

2. The remainder is a training set

The test set method

3. Perform your regression on the training set

(Linear regression example)

The test set method

4. Estimate your future performance with the test set

(Linear regression example)

Mean Squared Error = 2.4

The test set method

(Quadratic regression example)

The test set method

(Join the dots example)

The test set methodGood news:

•Very very simple

•Can then simply choose the method with the best test-set score

Bad news:

•What’s the downside?

The test set methodGood news:

•Very very simple

•Can then simply choose the method with the best test-set score

Bad news:

•Wastes data: we get an estimate of the best method to apply to 30% less data

•If we don’t have much data, our test-set might just be lucky or unlucky

We say the “test-set estimator of performance has high variance”

LOOCV (Leave-one-out Cross Validation)

For k=1 to R

1. Let (xk,yk) be the kth record

For k=1 to R

2. Temporarily remove (xk,yk)from the dataset

For k=1 to R

3. Train on the remaining R-1 datapoints

LOOCV (Leave-one-out Cross Validation)For k=1 to R

4. Note your error (xk,yk)

When you’ve done all points, report the mean error.

1. Let (xk,yk) be the kth

record

2. Temporarily remove (xk,yk) from the dataset

MSELOOCV = 2.12

LOOCV for Quadratic RegressionFor k=1 to R

record

MSELOOCV=0.962

LOOCV for Join The DotsFor k=1 to R

record

MSELOOCV=3.33

Which kind of Cross Validation?

Doesn’t waste data

Expensive. Has some weird behavior

Leave-one-out

CheapVariance: unreliable estimate of future performance

Test-setUpsideDownside

..can we get the best of both worlds?

k-fold Cross Validation

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue)

For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points.

For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points.

For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points.

Then report the mean errorLinear Regression MSE3FOLD=2.05

Then report the mean errorQuadratic Regression MSE3FOLD=1.11

Then report the mean errorJoint-the-dots MSE3FOLD=2.93

Which kind of Cross Validation?

Doesn’t waste dataExpensive. Has some weird behavior

Leave-one-out

Only wastes 10%. Only 10 times more expensive instead of R times.

Wastes 10% of the data. 10 times more expensive than test set

10-fold

Slightly better than test-set

Wastier than 10-fold. Expensivier than test set

3-fold

Identical to Leave-one-outR-fold

CheapVariance: unreliable estimate of future performance

Test-set

UpsideDownside

The bootstrapCV uses sampling without replacement

The same instance, once selected, can not beselected again for a particular training/test set

The bootstrap uses sampling withreplacement to form the training setSample a dataset of n instances n times with

replacement to form a new datasetof n instances

Use this data as the training setUse the instances from the original

dataset that don’t occur in the newtraining set for testing

The 0.632 bootstrap

Also called the 0.632 bootstrap A particular instance has a probability of

1–1/n of not being picked Thus its probability of ending up in the test

data is:

This means the training data will containapproximately 63.2% of the instances

368.01

' ( (e

Estimating errorwith the bootstrap

The error estimate on the test data will bevery pessimistic Trained on just ~63% of the instances

Therefore, combine it with theresubstitution error:

The resubstitution error gets less weightthan the error on the test data

Repeat process several times with differentreplacement samples; average the results

instances traininginstancestest 368.0632.0 eeerr !+!=

CV-based Model Selection• We’re trying to decide which algorithm to use.• We train each machine and make a table…

f44f55f66

⌦f33f22f11

Choice10-FOLD-CV-ERRTRAINERRfii

CV-based Model Selection• Example: Choosing “k” for a k-nearest-neighbor regression.• Step 1: Compute LOOCV error for six different model

classes:

• Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use.

⌦K=4

K=1Choice10-fold-CV-ERRTRAINERRAlgorithm

Note on parameter tuning

It is important that the test data is notused in any way to create the classifier

Some learning schemes operate in twostages: Stage 1: build the basic structure Stage 2: optimize parameter settings

The test data can’t be used for parametertuning!

Proper procedure uses three sets: trainingdata, validation data, and test data Validation data is used to optimize parameters

⌦Quad reg’n

LWR, KW=0.1

LWR, KW=0.5

Linear Reg’n

1-NNChoice10-fold-CV-ERRTRAINERRAlgorithm

CV-based Algorithm Choice• Example: Choosing which regression algorithm to use• Step 1: Compute 10-fold-CV error for six different model

classes:

• Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use.

Cross-validation for classification• Instead of computing the sum squared

errors on a test set, you should compute…

Cross-validation for classification• Instead of computing the sum squared

errors on a test set, you should compute…The total number of misclassifications on

a testset.

Counting the cost

In practice, different types of classificationerrors often incur different costs

Examples: Terrorist profiling

“Not a terrorist” correct 99.99% of the time

Loan decisions Oil-slick detection Fault diagnosis Promotional mailing

Counting the cost

The confusion matrix:

There many other types of cost!E.g.: cost of collecting training data

Actualclass True negativeFalse positiveNo

False negativeTrue positiveYes

Predicted class

Lift charts

In practice, costs are rarely known Decisions are usually made by comparing

possible scenarios Example: promotional mailout to 1,000,000

households• Mail to all; 0.1% respond (1000)• Data mining tool identifies subset of 100,000

most promising, 0.4% of these respond (400)40% of responses for 10% of cost may pay off

• Identify subset of 400,000 most promising,0.2% respond (800)

A lift chart allows a visual comparison

Generating a lift chart

Sort instances according to predicted probability ofbeing positive:

x axis is sample sizey axis is number of true positives

………

Yes0.884

No0.933

Yes0.932

Yes0.951

Actual classPredicted probability

A hypothetical lift chart

40% of responsesfor 10% of cost

80% of responsesfor 40% of cost

ROC curves

ROC curves are similar to lift charts Stands for “receiver operating characteristic” Used in signal detection to show tradeoff

between hit rate and false alarm rate overnoisy channel

Differences to lift chart: y axis shows percentage of true positives in

sample rather than absolute number

x axis shows percentage of false positives insample rather than sample size

A sample ROC curve

Jagged curve—one set of test data Smooth curve—use cross-validation

ROC curves for twoschemes

For a small, focused sample, use method A For a larger one, use method B In between, choose between A and B with appropriate probabilities

Precision and Recall

• typically used in document retrieval

• Precision:– how many of the returned documents are correct

– precision(threshold)

• Recall:– how many of the positives does the model return

– recall(threshold)

• Precision/Recall Curve: sweep thresholds

Cross-validation for detecting and preventing...

Documents

Transcript of Cross-validation for detecting and preventing...

Cross Calibration and Validation using CLARREO

Tricks for Optimizing Cross- Validation Rules in General Ledger...Setting Up Cross-Validation Rules 1. A cross-validation rule only applies to a single chart of accounts structure

Performance-Estimation Properties of Cross-Validation ...€¦ · Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization

Cross Validation & EnsemblingShan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34. Nested Cross Validation Cross validation (CV) can be applied to bothhyperparameter tuning

Lecture 11: Cross validation - Stanford University · Leave one out cross-validation ComputingCV (n) canbecomputationallyexpensive,sinceit involvesﬁttingthemodeln times. Forlinearregression,thereisashortcut:

Cross Validation SS 2009 54

Bootstraps, permutation tests, and cross-validation

Model assessment and cross-validation - overview

Lecture7 cross validation

Generalized Cross Validation in variable selection with ...homepages.ulb.ac.be/~majansen/publications/jansen15gcvreprint.pdfGeneralized Cross Validation in variable selection with

Spring06 reliable repeatable wafer

Geo479/579: Geostatistics Ch15. Cross Validation.

Spring06 optimizingfinfet

Cross-validation aggregation for forecasting

Intermediate Macro 110A/ Spring06

Classifier Performance -Curvas ROC - Cross Validation

12 Cross Validation

Computational performance and cross-validation error ...

CSC 411: Lecture 4 - Logistic regressionjlucas/teaching/csc411/lectures/lec4_handout.pdfCSC411-Lec4. Logistic regression Regularization Validation Cross-validation Leave-p-out cross-validation:

Development of Optimized Phenomic Predictors for Efficient ...downloads.spj.sciencemag.org/plantphenomics/2019/5809404.pdf · PlantPhenomics Cross-Validation 1 (CV1) Cross-Validation