Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...
Transcript of Ridge Regression - courses.cs.washington.edu · 11 CSE/STAT 416: Intro to Machine Learning...
4/10/18
1
CSE/STAT 416: Intro to Machine Learning
CSE/STAT 416: Intro to Machine LearningEmily FoxUniversity of WashingtonApril 10, 2018
©2018 Emily Fox
Ridge Regression:Regulating overfitting when using many features
CSE/STAT 416: Intro to Machine Learning2
Training, true, & test error vs. model complexity
©2018 Emily Fox
Model complexity
Erro
r
Overfitting if:
x
y
x
y
4/10/18
2
CSE/STAT 416: Intro to Machine Learning3
Bias-variance tradeoff
©2018 Emily Fox
Model complexity
x
y
x
y
CSE/STAT 416: Intro to Machine Learning4
Error vs. amount of data
©2018 Emily Fox
# data points in training set
Erro
r
4/10/18
3
CSE/STAT 416: Intro to Machine Learning
Summary of assessing performance
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning6
What you can do now…
• Describe what a loss function is and give examples• Contrast training and test error• Compute training and test error given a loss function• Discuss issue of assessing performance on training set• Describe tradeoffs in forming training/test splits• List and interpret the 3 sources of avg. prediction error
- Irreducible error, bias, and variance
©2018 Emily Fox
4/10/18
4
CSE/STAT 416: Intro to Machine Learning
Overfitting of polynomial regression
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning8
Flexibility of high-order polynomials
©2018 Emily Fox
square feet (sq.ft.)
pri
ce ($
)
x
fŵ
square feet (sq.ft.)
pri
ce ($
)
x
yfŵ
y
yi = w0 + w1 xi+ w2 xi2 + … + wp xi
p + εi
OVERFIT
4/10/18
5
CSE/STAT 416: Intro to Machine Learning9
Symptom of overfitting
Often, overfitting associated with verylarge estimated parameters ŵ
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning
Overfitting of linear regressionmodels more generically
©2018 Emily Fox
4/10/18
6
CSE/STAT 416: Intro to Machine Learning11
Overfitting with many features
Not unique to polynomial regression,but also if lots of inputs (d large)
Or, generically, lots of features (D large)
yi = wj hj(xi) + εi
©2018 Emily Fox
DX
j=0
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
CSE/STAT 416: Intro to Machine Learning12
How does # of observations influence overfitting?
Few observations (N small) à rapidly overfit as model complexity increasesMany observations (N very large) à harder to overfit
©2018 Emily Fox
square feet (sq.ft.)
pri
ce ($
)
x
fŵy
square feet (sq.ft.)
pri
ce ($
)
x
fŵy
4/10/18
7
CSE/STAT 416: Intro to Machine Learning13
How does # of inputs influence overfitting?
1 input (e.g., sq.ft.):Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting
©2018 Emily Fox
HARD
square feet (sq.ft.)
pri
ce ($
)
x
fŵy
CSE/STAT 416: Intro to Machine Learning14
How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):Data must include examples of all possible(sq.ft., #bath, #bed, lot size, year,…., $) combosto avoid overfitting
©2018 Emily Fox
MUCH!!!
HARDERsquare feet (sq.ft.)
pri
ce ($
)
x[1]
y
# bat
hroom
s
x[2]
4/10/18
8
CSE/STAT 416: Intro to Machine Learning
Adding term to cost-of-fitto prefer small coefficients
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning16 ©2018 Emily Fox
TrainingData
Featureextraction
ML model
Qualitymetric
ML algorithm
y
h(x) ŷ
ŵ
4/10/18
9
CSE/STAT 416: Intro to Machine Learning17
Desired total cost format
Want to balance:i. How well function fits dataii. Magnitude of coefficients
Total cost =measure of fit + measure of magnitude of coefficients
©2018 Emily Fox
small # = good fit to training data
small # = not overfit
want to balancemeasure quality of fit
CSE/STAT 416: Intro to Machine Learning18
Measure of fit to training data
©2018 Emily Fox
RSS(w) = (yi-h(xi)Tw)2
= (yi-ŷi(w))2
NX
i=1
square feet (sq.ft.)
pri
ce ($
)
x[1]
y
# bat
hroo
ms
x[2]NX
i=1
4/10/18
10
CSE/STAT 416: Intro to Machine Learning19
What summary # is indicative of size of regression coefficients?
- Sum?
- Sum of absolute value?
- Sum of squares (L2 norm)
©2018 Emily Fox
Measure of magnitude of regression coefficient
CSE/STAT 416: Intro to Machine Learning20
Consider specific total cost
Total cost =measure of fit + measure of magnitude of coefficients
©2018 Emily Fox
4/10/18
11
CSE/STAT 416: Intro to Machine Learning21
Consider specific total cost
Total cost =measure of fit + measure of magnitude of coefficients
©2018 Emily Fox
RSS(w) ||w||22
CSE/STAT 416: Intro to Machine Learning22
Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + ||w||2tuning parameter = balance of fit and magnitude
©2018 Emily Fox
λ 2
4/10/18
12
CSE/STAT 416: Intro to Machine Learning23
Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2018 Emily Fox
λ
Ridge regression(a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2
CSE/STAT 416: Intro to Machine Learning24
Bias-variance tradeoff
Large λ:high bias, low variance(e.g., ŵ =0 for λ=∞)
Small λ:low bias, high variance
(e.g., standard least squares (RSS) fit ofhigh-order polynomial for λ=0)
©2018 Emily Fox
In essence, λcontrols model
complexity
4/10/18
13
CSE/STAT 416: Intro to Machine Learning25
Revisit polynomial fit demo
What happens if we refit our high-orderpolynomial, but now using ridge regression?
Will consider a few settings of λ …
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning26
0 50000 100000 150000 200000−100000
0100000
200000
coefficient paths −− Ridge
λ
coefficient
bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront
Coefficient path
©2018 Emily Fox
λ
coef
ficie
nts
ŵj
4/10/18
14
CSE/STAT 416: Intro to Machine Learning
How to choose λ
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning28
The regression/ML workflow
1. Model selectionNeed to choose tuning parameters λ controlling model complexity
2. Model assessmentHaving selected a model, assess generalization error
©2018 Emily Fox
4/10/18
15
CSE/STAT 416: Intro to Machine Learning29
Hypothetical implementation
1. Model selectionFor each considered λ :i. Estimate parameters ŵλ on training dataii. Assess performance of ŵλ on test dataiii. Choose λ* to be λ with lowest test error
2. Model assessmentCompute test error of ŵλ* (fitted model for selected λ*) to approx. true error
©2018 Emily Fox
Training set Test set
Overly optimistic!
CSE/STAT 416: Intro to Machine Learning30
Hypothetical implementation
©2018 Emily Fox
Issue: Just like fitting ŵ and assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)
• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates
Training set Test set
4/10/18
16
CSE/STAT 416: Intro to Machine Learning31
Training set Test set
Practical implementation
©2018 Emily Fox
Solution: Create two “test” sets!
1. Select λ* such that ŵλ* minimizes error on validation set
2. Approximate true error of ŵλ* using test set
Validation set
Training setTest set
CSE/STAT 416: Intro to Machine Learning32
Practical implementation
©2018 Emily Fox
Validation set
Training setTest set
fit ŵλtest performance of ŵλ to select λ*
assess true error of ŵλ*
4/10/18
17
CSE/STAT 416: Intro to Machine Learning33
Typical splits
©2018 Emily Fox
Validation set
Training setTest set
80% 10% 10%
50% 25% 25%
CSE/STAT 416: Intro to Machine Learning
How to handle the intercept
©2018 Emily Fox
PRACTICALITIES
4/10/18
18
CSE/STAT 416: Intro to Machine Learning35
Recall multiple regression model
©2018 Emily Fox
Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., x[1]feature 3 = h2(x)… e.g., x[2]…feature D+1 = hD(x)… e.g., x[d]
DX
j=0
CSE/STAT 416: Intro to Machine Learning36
Do we penalize intercept?
Standard ridge regression cost:
Encourages intercept w0 to also be small
Do we want a small intercept? Conceptually, not indicative of overfitting…
©2018 Emily Fox
RSS(w) + ||w||2λ 2
strength of penalty
4/10/18
19
CSE/STAT 416: Intro to Machine Learning37
Option 1: Don’t penalize intercept
Modified ridge regression cost:
©2018 Emily Fox
RSS(w0,wrest) + ||wrest||2λ 2
CSE/STAT 416: Intro to Machine Learning38
Option 2: Center data first
If data are first centered about 0, thenfavoring small intercept not so worrisome
Step 1: Transform y to have 0 meanStep 2: Run ridge regression as normal
(closed-form or gradient algorithms)
©2018 Emily Fox
4/10/18
20
CSE/STAT 416: Intro to Machine Learning
Feature normalization
©2018 Emily Fox
PRACTICALITIES
CSE/STAT 416: Intro to Machine Learning40
p
Normalizing features
Scale training columns (not rows!) as:
Apply same training scale factors to test data:
©2018 Emily Fox
hj(xk) = hj(xk)
hj(xi)2NX
i=1
summing over training pointsapply to
test point
Training features
Testfeatures
Normalizer:zj
Normalizer:zj
Feat.
0
Feat.
1
Feat.
D
Feat.
2 …
hj(xk) = hj(xk)p
hj(xi)2NX
i=1
4/10/18
21
CSE/STAT 416: Intro to Machine Learning
Summary for ridge regression
©2018 Emily Fox
CSE/STAT 416: Intro to Machine Learning42
What you can do now…• Describe what happens to magnitude of estimated coefficients when
model is overfit• Motivate form of ridge regression cost function• Describe what happens to estimated coefficients of ridge regression as
tuning parameter λ is varied• Interpret coefficient path plot• Use a validation set to select the ridge regression tuning parameter λ• Handle intercept and scale of features with care
©2018 Emily Fox