Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...

21
4/10/18 1 CSE/STAT 416: Intro to Machine Learning Emily Fox University of Washington April 10, 2018 ©2018 Emily Fox Ridge Regression: Regulating overfitting when using many features CSE/STAT 416: Intro to Machine Learning 2 Training, true, & test error vs. model complexity ©2018 Emily Fox Model complexity Error Overfitting if: x y x y

Transcript of Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...

Page 1: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

1

CSE/STAT 416: Intro to Machine Learning

CSE/STAT 416: Intro to Machine LearningEmily FoxUniversity of WashingtonApril 10, 2018

©2018 Emily Fox

Ridge Regression:Regulating overfitting when using many features

CSE/STAT 416: Intro to Machine Learning2

Training, true, & test error vs. model complexity

©2018 Emily Fox

Model complexity

Erro

r

Overfitting if:

x

y

x

y

Page 2: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

2

CSE/STAT 416: Intro to Machine Learning3

Bias-variance tradeoff

©2018 Emily Fox

Model complexity

x

y

x

y

CSE/STAT 416: Intro to Machine Learning4

Error vs. amount of data

©2018 Emily Fox

# data points in training set

Erro

r

Page 3: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

3

CSE/STAT 416: Intro to Machine Learning

Summary of assessing performance

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning6

What you can do now…

• Describe what a loss function is and give examples• Contrast training and test error• Compute training and test error given a loss function• Discuss issue of assessing performance on training set• Describe tradeoffs in forming training/test splits• List and interpret the 3 sources of avg. prediction error

- Irreducible error, bias, and variance

©2018 Emily Fox

Page 4: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

4

CSE/STAT 416: Intro to Machine Learning

Overfitting of polynomial regression

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning8

Flexibility of high-order polynomials

©2018 Emily Fox

square feet (sq.ft.)

pri

ce ($

)

x

square feet (sq.ft.)

pri

ce ($

)

x

yfŵ

y

yi = w0 + w1 xi+ w2 xi2 + … + wp xi

p + εi

OVERFIT

Page 5: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

5

CSE/STAT 416: Intro to Machine Learning9

Symptom of overfitting

Often, overfitting associated with verylarge estimated parameters ŵ

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning

Overfitting of linear regressionmodels more generically

©2018 Emily Fox

Page 6: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

6

CSE/STAT 416: Intro to Machine Learning11

Overfitting with many features

Not unique to polynomial regression,but also if lots of inputs (d large)

Or, generically, lots of features (D large)

yi = wj hj(xi) + εi

©2018 Emily Fox

DX

j=0

- Square feet

- # bathrooms

- # bedrooms

- Lot size

- Year built

- …

CSE/STAT 416: Intro to Machine Learning12

How does # of observations influence overfitting?

Few observations (N small) à rapidly overfit as model complexity increasesMany observations (N very large) à harder to overfit

©2018 Emily Fox

square feet (sq.ft.)

pri

ce ($

)

x

fŵy

square feet (sq.ft.)

pri

ce ($

)

x

fŵy

Page 7: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

7

CSE/STAT 416: Intro to Machine Learning13

How does # of inputs influence overfitting?

1 input (e.g., sq.ft.):Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting

©2018 Emily Fox

HARD

square feet (sq.ft.)

pri

ce ($

)

x

fŵy

CSE/STAT 416: Intro to Machine Learning14

How does # of inputs influence overfitting?

d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):Data must include examples of all possible(sq.ft., #bath, #bed, lot size, year,…., $) combosto avoid overfitting

©2018 Emily Fox

MUCH!!!

HARDERsquare feet (sq.ft.)

pri

ce ($

)

x[1]

y

# bat

hroom

s

x[2]

Page 8: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

8

CSE/STAT 416: Intro to Machine Learning

Adding term to cost-of-fitto prefer small coefficients

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning16 ©2018 Emily Fox

TrainingData

Featureextraction

ML model

Qualitymetric

ML algorithm

y

h(x) ŷ

ŵ

Page 9: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

9

CSE/STAT 416: Intro to Machine Learning17

Desired total cost format

Want to balance:i. How well function fits dataii. Magnitude of coefficients

Total cost =measure of fit + measure of magnitude of coefficients

©2018 Emily Fox

small # = good fit to training data

small # = not overfit

want to balancemeasure quality of fit

CSE/STAT 416: Intro to Machine Learning18

Measure of fit to training data

©2018 Emily Fox

RSS(w) = (yi-h(xi)Tw)2

= (yi-ŷi(w))2

NX

i=1

square feet (sq.ft.)

pri

ce ($

)

x[1]

y

# bat

hroo

ms

x[2]NX

i=1

Page 10: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

10

CSE/STAT 416: Intro to Machine Learning19

What summary # is indicative of size of regression coefficients?

- Sum?

- Sum of absolute value?

- Sum of squares (L2 norm)

©2018 Emily Fox

Measure of magnitude of regression coefficient

CSE/STAT 416: Intro to Machine Learning20

Consider specific total cost

Total cost =measure of fit + measure of magnitude of coefficients

©2018 Emily Fox

Page 11: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

11

CSE/STAT 416: Intro to Machine Learning21

Consider specific total cost

Total cost =measure of fit + measure of magnitude of coefficients

©2018 Emily Fox

RSS(w) ||w||22

CSE/STAT 416: Intro to Machine Learning22

Consider resulting objective

What if ŵ selected to minimize

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + ||w||2tuning parameter = balance of fit and magnitude

©2018 Emily Fox

λ 2

Page 12: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

12

CSE/STAT 416: Intro to Machine Learning23

Consider resulting objective

What if ŵ selected to minimize

RSS(w) + ||w||2

©2018 Emily Fox

λ

Ridge regression(a.k.a L2 regularization)

tuning parameter = balance of fit and magnitude

2

CSE/STAT 416: Intro to Machine Learning24

Bias-variance tradeoff

Large λ:high bias, low variance(e.g., ŵ =0 for λ=∞)

Small λ:low bias, high variance

(e.g., standard least squares (RSS) fit ofhigh-order polynomial for λ=0)

©2018 Emily Fox

In essence, λcontrols model

complexity

Page 13: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

13

CSE/STAT 416: Intro to Machine Learning25

Revisit polynomial fit demo

What happens if we refit our high-orderpolynomial, but now using ridge regression?

Will consider a few settings of λ …

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning26

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path

©2018 Emily Fox

λ

coef

ficie

nts

ŵj

Page 14: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

14

CSE/STAT 416: Intro to Machine Learning

How to choose λ

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning28

The regression/ML workflow

1. Model selectionNeed to choose tuning parameters λ controlling model complexity

2. Model assessmentHaving selected a model, assess generalization error

©2018 Emily Fox

Page 15: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

15

CSE/STAT 416: Intro to Machine Learning29

Hypothetical implementation

1. Model selectionFor each considered λ :i. Estimate parameters ŵλ on training dataii. Assess performance of ŵλ on test dataiii. Choose λ* to be λ with lowest test error

2. Model assessmentCompute test error of ŵλ* (fitted model for selected λ*) to approx. true error

©2018 Emily Fox

Training set Test set

Overly optimistic!

CSE/STAT 416: Intro to Machine Learning30

Hypothetical implementation

©2018 Emily Fox

Issue: Just like fitting ŵ and assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)

• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates

Training set Test set

Page 16: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

16

CSE/STAT 416: Intro to Machine Learning31

Training set Test set

Practical implementation

©2018 Emily Fox

Solution: Create two “test” sets!

1. Select λ* such that ŵλ* minimizes error on validation set

2. Approximate true error of ŵλ* using test set

Validation set

Training setTest set

CSE/STAT 416: Intro to Machine Learning32

Practical implementation

©2018 Emily Fox

Validation set

Training setTest set

fit ŵλtest performance of ŵλ to select λ*

assess true error of ŵλ*

Page 17: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

17

CSE/STAT 416: Intro to Machine Learning33

Typical splits

©2018 Emily Fox

Validation set

Training setTest set

80% 10% 10%

50% 25% 25%

CSE/STAT 416: Intro to Machine Learning

How to handle the intercept

©2018 Emily Fox

PRACTICALITIES

Page 18: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

18

CSE/STAT 416: Intro to Machine Learning35

Recall multiple regression model

©2018 Emily Fox

Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., x[1]feature 3 = h2(x)… e.g., x[2]…feature D+1 = hD(x)… e.g., x[d]

DX

j=0

CSE/STAT 416: Intro to Machine Learning36

Do we penalize intercept?

Standard ridge regression cost:

Encourages intercept w0 to also be small

Do we want a small intercept? Conceptually, not indicative of overfitting…

©2018 Emily Fox

RSS(w) + ||w||2λ 2

strength of penalty

Page 19: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

19

CSE/STAT 416: Intro to Machine Learning37

Option 1: Don’t penalize intercept

Modified ridge regression cost:

©2018 Emily Fox

RSS(w0,wrest) + ||wrest||2λ 2

CSE/STAT 416: Intro to Machine Learning38

Option 2: Center data first

If data are first centered about 0, thenfavoring small intercept not so worrisome

Step 1: Transform y to have 0 meanStep 2: Run ridge regression as normal

(closed-form or gradient algorithms)

©2018 Emily Fox

Page 20: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

20

CSE/STAT 416: Intro to Machine Learning

Feature normalization

©2018 Emily Fox

PRACTICALITIES

CSE/STAT 416: Intro to Machine Learning40

p

Normalizing features

Scale training columns (not rows!) as:

Apply same training scale factors to test data:

©2018 Emily Fox

hj(xk) = hj(xk)

hj(xi)2NX

i=1

summing over training pointsapply to

test point

Training features

Testfeatures

Normalizer:zj

Normalizer:zj

Feat.

0

Feat.

1

Feat.

D

Feat.

2 …

hj(xk) = hj(xk)p

hj(xi)2NX

i=1

Page 21: Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning Overfittingwith many features Not unique to polynomial regression, but also if lots of inputs

4/10/18

21

CSE/STAT 416: Intro to Machine Learning

Summary for ridge regression

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning42

What you can do now…• Describe what happens to magnitude of estimated coefficients when

model is overfit• Motivate form of ridge regression cost function• Describe what happens to estimated coefficients of ridge regression as

tuning parameter λ is varied• Interpret coefficient path plot• Use a validation set to select the ridge regression tuning parameter λ• Handle intercept and scale of features with care

©2018 Emily Fox