Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...

4/10/18

1

CSE/STAT 416: Intro to Machine Learning

CSE/STAT 416: Intro to Machine LearningEmily FoxUniversity of WashingtonApril 10, 2018

©2018 Emily Fox

Ridge Regression:Regulating overfitting when using many features

CSE/STAT 416: Intro to Machine Learning2

Training, true, & test error vs. model complexity

©2018 Emily Fox

Model complexity

Erro

r

Overfitting if:

x

y

x

y

4/10/18

2


Bias-variance tradeoff

©2018 Emily Fox

Model complexity

x

y

x

y


Error vs. amount of data

©2018 Emily Fox

# data points in training set

Erro

r

4/10/18

3


Summary of assessing performance

©2018 Emily Fox


What you can do now…

• Describe what a loss function is and give examples• Contrast training and test error• Compute training and test error given a loss function• Discuss issue of assessing performance on training set• Describe tradeoffs in forming training/test splits• List and interpret the 3 sources of avg. prediction error

- Irreducible error, bias, and variance

©2018 Emily Fox

4/10/18

4


Overfitting of polynomial regression

©2018 Emily Fox


Flexibility of high-order polynomials

©2018 Emily Fox

square feet (sq.ft.)

pri

ce ($

)

x

fŵ


pri

ce ($

)

x

yfŵ

y

yi = w0 + w1 xi+ w2 xi2 + … + wp xi

p + εi

OVERFIT

4/10/18

5


Symptom of overfitting

Often, overfitting associated with verylarge estimated parameters ŵ

©2018 Emily Fox


Overfitting of linear regressionmodels more generically

©2018 Emily Fox

4/10/18

6


Overfitting with many features

Not unique to polynomial regression,but also if lots of inputs (d large)

Or, generically, lots of features (D large)

yi = wj hj(xi) + εi

©2018 Emily Fox

DX

j=0

- Square feet

- # bathrooms

- # bedrooms

- Lot size

- Year built

- …


How does # of observations influence overfitting?

Few observations (N small) à rapidly overfit as model complexity increasesMany observations (N very large) à harder to overfit

©2018 Emily Fox


pri

ce ($

)

x

fŵy


pri

ce ($

)

x

fŵy

4/10/18

7


How does # of inputs influence overfitting?

1 input (e.g., sq.ft.):Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting

©2018 Emily Fox

HARD


pri

ce ($

)

x

fŵy


How does # of inputs influence overfitting?

d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):Data must include examples of all possible(sq.ft., #bath, #bed, lot size, year,…., $) combosto avoid overfitting

©2018 Emily Fox

MUCH!!!

HARDERsquare feet (sq.ft.)

pri

ce ($

)

x[1]

y

# bat

hroom

s

x[2]

4/10/18

8


Adding term to cost-of-fitto prefer small coefficients

©2018 Emily Fox

CSE/STAT 416: Intro to Machine Learning16 ©2018 Emily Fox

TrainingData

Featureextraction

ML model

Qualitymetric

ML algorithm

y

h(x) ŷ

ŵ

4/10/18

9


Desired total cost format

Want to balance:i. How well function fits dataii. Magnitude of coefficients

Total cost =measure of fit + measure of magnitude of coefficients

©2018 Emily Fox

small # = good fit to training data

small # = not overfit

want to balancemeasure quality of fit


Measure of fit to training data

©2018 Emily Fox

RSS(w) = (yi-h(xi)Tw)2

= (yi-ŷi(w))2

NX

i=1


pri

ce ($

)

x[1]

y

# bat

hroo

ms

x[2]NX

i=1

4/10/18

10


What summary # is indicative of size of regression coefficients?

- Sum?

- Sum of absolute value?

- Sum of squares (L2 norm)

©2018 Emily Fox

Measure of magnitude of regression coefficient


Consider specific total cost


©2018 Emily Fox

4/10/18

11


Consider specific total cost


©2018 Emily Fox

RSS(w) ||w||22


Consider resulting objective

What if ŵ selected to minimize

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + ||w||2tuning parameter = balance of fit and magnitude

©2018 Emily Fox

λ 2

4/10/18

12


Consider resulting objective

What if ŵ selected to minimize

RSS(w) + ||w||2

©2018 Emily Fox

λ

Ridge regression(a.k.a L2 regularization)

tuning parameter = balance of fit and magnitude

2


Bias-variance tradeoff

Large λ:high bias, low variance(e.g., ŵ =0 for λ=∞)

Small λ:low bias, high variance

(e.g., standard least squares (RSS) fit ofhigh-order polynomial for λ=0)

©2018 Emily Fox

In essence, λcontrols model

complexity

4/10/18

13


Revisit polynomial fit demo

What happens if we refit our high-orderpolynomial, but now using ridge regression?

Will consider a few settings of λ …

©2018 Emily Fox


0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path

©2018 Emily Fox

λ

coef

ficie

nts

ŵj

4/10/18

14


How to choose λ

©2018 Emily Fox


The regression/ML workflow

1. Model selectionNeed to choose tuning parameters λ controlling model complexity

2. Model assessmentHaving selected a model, assess generalization error

©2018 Emily Fox

4/10/18

15


Hypothetical implementation

1. Model selectionFor each considered λ :i. Estimate parameters ŵλ on training dataii. Assess performance of ŵλ on test dataiii. Choose λ* to be λ with lowest test error

2. Model assessmentCompute test error of ŵλ* (fitted model for selected λ*) to approx. true error

©2018 Emily Fox

Training set Test set

Overly optimistic!


Hypothetical implementation

©2018 Emily Fox

Issue: Just like fitting ŵ and assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)

• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates


4/10/18

16



Practical implementation

©2018 Emily Fox

Solution: Create two “test” sets!

1. Select λ* such that ŵλ* minimizes error on validation set

2. Approximate true error of ŵλ* using test set

Validation set

Training setTest set


Practical implementation

©2018 Emily Fox

Validation set


fit ŵλtest performance of ŵλ to select λ*

assess true error of ŵλ*

4/10/18

17


Typical splits

©2018 Emily Fox

Validation set


80% 10% 10%

50% 25% 25%


How to handle the intercept

©2018 Emily Fox

PRACTICALITIES

4/10/18

18


Recall multiple regression model

©2018 Emily Fox

Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., x[1]feature 3 = h2(x)… e.g., x[2]…feature D+1 = hD(x)… e.g., x[d]

DX

j=0


Do we penalize intercept?

Standard ridge regression cost:

Encourages intercept w0 to also be small

Do we want a small intercept? Conceptually, not indicative of overfitting…

©2018 Emily Fox

RSS(w) + ||w||2λ 2

strength of penalty

4/10/18

19


Option 1: Don’t penalize intercept

Modified ridge regression cost:

©2018 Emily Fox

RSS(w0,wrest) + ||wrest||2λ 2


Option 2: Center data first

If data are first centered about 0, thenfavoring small intercept not so worrisome

Step 1: Transform y to have 0 meanStep 2: Run ridge regression as normal

(closed-form or gradient algorithms)

©2018 Emily Fox

4/10/18

20


Feature normalization

©2018 Emily Fox

PRACTICALITIES


p

Normalizing features

Scale training columns (not rows!) as:

Apply same training scale factors to test data:

©2018 Emily Fox

hj(xk) = hj(xk)

hj(xi)2NX

i=1

summing over training pointsapply to

test point

Training features

Testfeatures

Normalizer:zj

Normalizer:zj

Feat.

0

Feat.

1

Feat.

D

Feat.

2 …

hj(xk) = hj(xk)p

hj(xi)2NX

i=1

4/10/18

21


Summary for ridge regression

©2018 Emily Fox


What you can do now…• Describe what happens to magnitude of estimated coefficients when

model is overfit• Motivate form of ridge regression cost function• Describe what happens to estimated coefficients of ridge regression as

tuning parameter λ is varied• Interpret coefficient path plot• Use a validation set to select the ridge regression tuning parameter λ• Handle intercept and scale of features with care

©2018 Emily Fox

Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...

Documents

Transcript of Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...