Lasso Regression - cs229.stanford.edu

29
CS229: Machine Learning Carlos Guestrin Stanford University Slides include content developed by and co-developed with Emily Fox ©2021 Carlos Guestrin Lasso Regression: Regularization for feature selection

Transcript of Lasso Regression - cs229.stanford.edu

Page 1: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

CS229: Machine LearningCarlos GuestrinStanford UniversitySlides include content developed by and co-developed with Emily Fox

©2021 Carlos Guestrin

Lasso Regression:Regularization for feature selection

Page 2: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

Feature selection task

©2021 Carlos Guestrin

Page 3: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning3

Efficiency: - If size(w) = 100B, each prediction is expensive- If ŵ sparse , computation only depends on # of non-zeros

Interpretability: - Which features are relevant for prediction?

Why might you want to performfeature selection?

©2021 Carlos Guestrin

X

wj 6=0

ŷi = ŵj hj(xi)

Page 4: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning4

Sparsity: Housing application

$ ?

Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style

DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System

Lot sizeSingle FamilyYear builtLast sold priceLast sale price/sqftFinished sqftUnfinished sqftFinished basement sqft# floorsFlooring typesParking typeParking amountCoolingHeatingExterior materialsRoof typeStructure style

DishwasherGarbage disposalMicrowaveRange / OvenRefrigeratorWasherDryerLaundry locationHeating typeJetted TubDeckFenced YardLawnGardenSprinkler System…

©2021 Carlos Guestrin

Page 5: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning5

Sparsity: Reading your mind

©2021 Carlos Guestrin

very sad very happy

Activity in which brain regions can predict

happiness?

Page 6: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning6

Explaining Predictions

©2021 Carlos Guestrin

“Why should I trust you?”: Explaining the Predictions of Any Classifier. Ribeiro, Singh & G. KDD 16

Page 7: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

Option 1: All subsets or greedy variants

©2021 Carlos Guestrin

Page 8: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning8

Find best model of for each size

©2021 Carlos Guestrin

# of features

RSS(

ŵ)

0 1 2 3 4 5 6 7 8

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

- # bedrooms- # bathrooms- sq.ft. living- sq.ft. lot- floors- year built- year renovated- waterfront

Page 9: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning9

Complexity of “all subsets”

©2021 Carlos Guestrin

yi = εi

yi = w0h0(xi) + εi

yi = w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + εi

yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

[0 0 0 … 0 0 0]

[1 0 0 … 0 0 0]

[0 1 0 … 0 0 0]

[1 1 0 … 0 0 0]

[1 1 1 … 1 1 1]…

2D

28 = 256230 = 1,073,741,82421000 = 1.071509 x 10301

2100B = HUGE!!!!!!

Typically, computationally

infeasible

Page 10: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning10

Greedy algorithmsForward stepwise:Starting from simple model and iteratively add features most useful to fit

Backward stepwise:Start with full model and iteratively remove features least useful to fit

Combining forward and backward steps:In forward algorithm, insert steps to remove features no longer as important

Lots of other variants, too.

©2021 Carlos Guestrin

Page 11: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

Option 2: Regularize

©2021 Carlos Guestrin

Page 12: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning12

Ridge regression: L2 regularized regression

Total cost =measure of fit + λ measure of magnitude of coefficients

©2021 Carlos Guestrin

RSS(w) ||w||2=w02+…+wD

22

Page 13: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning13

Coefficient path – ridge

©2021 Carlos Guestrin

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

coef

ficie

nts ŵ

j

Page 14: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning14

Using regularization for feature selection

Instead of searching over a discrete set of solutions, can we use regularization?- Start with full model (all possible features)- “Shrink” some coefficients exactly to 0• i.e., knock out certain features

- Non-zero coefficients indicate “selected” features

©2021 Carlos Guestrin

Page 15: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning15

Thresholding ridge coefficients?

Why don’t we just set small ridge coefficients to 0?

©2021 Carlos Guestrin

# bedrooms

# bathrooms

sq.ft. liv

ing

sq.ft. lot

floors

year built

year renovated

last sales p

rice

cost per sq

.ft.

heating

# showers

waterfront

0

Page 16: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning16

Thresholding ridge coefficients?

Selected features for a given threshold value

©2021 Carlos Guestrin

# bedrooms

# bathrooms

sq.ft. liv

ing

sq.ft. lot

floors

year built

year renovated

last sales p

rice

cost per sq

.ft.

heating

# showers

waterfront

0

Page 17: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning17

Thresholding ridge coefficients?

Let’s look at two related features…

©2021 Carlos Guestrin

# bedrooms

# bathrooms

sq.ft. liv

ing

sq.ft. lot

floors

year built

year renovated

last sales p

rice

cost per sq

.ft.

heating

# showers

waterfront

0

Nothing measuring bathrooms was included!

Page 18: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning18

Thresholding ridge coefficients?

If only one of the features had been included…

©2021 Carlos Guestrin

# bedrooms

# bathrooms

sq.ft. liv

ing

sq.ft. lot

floors

year built

year renovated

last sales p

rice

cost per sq

.ft.

heating

# showers

waterfront

0

Page 19: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning19

Thresholding ridge coefficients?

Would have included bathrooms in selected model

©2021 Carlos Guestrin

# bedrooms

# bathrooms

sq.ft. liv

ing

sq.ft. lot

floors

year built

year renovated

last sales p

rice

cost per sq

.ft.

heating

# showers

waterfront

0

Can regularization lead directly to sparsity?

Page 20: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning20

Try this cost instead of ridge…

Total cost =measure of fit + λ measure of magnitude of coefficients

©2021 Carlos Guestrin

RSS(w) ||w||1=|w0|+…+|wD|

Lasso regression(a.k.a. L1 regularized regression)

Leads to sparse solutions!

Page 21: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning21

Lasso regression: L1 regularized regression

Just like ridge regression, solution is governed by a continuous parameter λ

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + λ||w||1tuning parameter = balance of fit and sparsity

©2021 Carlos Guestrin

Page 22: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning22

Coefficient path – ridge

©2021 Carlos Guestrin

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− Ridge

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

λ

coef

ficie

nts ŵ

j

Page 23: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning23

0 50000 100000 150000 200000−100000

0100000

200000

coefficient paths −− LASSO

λ

coefficient

bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront

Coefficient path – lasso

©2021 Carlos Guestrin

λ

coef

ficie

nts ŵ

j

Page 24: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

Practical concerns with lasso

©2021 Carlos Guestrin

Page 25: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning25

Debiasing lassoLasso shrinks coefficients relative to LS solutionà more bias, less variance

Can reduce bias as follows:1. Run lasso to select features2. Run least squares

regression with only selected features

“Relevant” features no longer shrunk relative to LS fit of same reduced model

©2021 Carlos Guestrin

Figure used with permission of Mario Figueiredo(captions modified to fit course)

True coefficients (D=4096, non-zero = 160)

L1 reconstruction (non-zero = 1024, MSE = 0.0072)

Debiased (non-zero = 1024, MSE = 3.26e-005)

Least squares (non-zero = 0, MSE = 1.568)

Page 26: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning26

Issues with standard lasso objective1. With group of highly correlated features, lasso tends to select amongst them

arbitrarily- Often prefer to select all together

2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution

Elastic net aims to address these issues- hybrid between lasso and ridge regression- uses L1 and L2 penalties

See Zou & Hastie ‘05 for further discussion

©2021 Carlos Guestrin

Page 27: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning

Summary for feature selection and lasso regression

©2021 Carlos Guestrin

Page 28: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning28

Impact of feature selection and lasso

Lasso has changed machine learning, statistics, & electrical engineering

But, for feature selection in general, be careful about interpreting selected features- selection only considers features included- sensitive to correlations between features- result depends on algorithm used- there are theoretical guarantees for lasso under certain conditions

©2021 Carlos Guestrin

Page 29: Lasso Regression - cs229.stanford.edu

CS229: Machine Learning29

What you can do now…• Describe “all subsets” and greedy variants for feature selection• Analyze computational costs of these algorithms• Formulate lasso objective• Describe what happens to estimated lasso coefficients as tuning parameter λ is

varied• Interpret lasso coefficient path plot• Contrast ridge and lasso regression• Implement K-fold cross validation to select lasso tuning parameter λ

©2021 Carlos Guestrin