Lecture 8: Beyond linearity

46
Lecture 8: Beyond linearity STATS 202: Data Mining and Analysis Linh Tran [email protected] Department of Statistics Stanford University July 19, 2021 STATS 202: Data Mining and Analysis L. Tran 1/38

Transcript of Lecture 8: Beyond linearity

Page 1: Lecture 8: Beyond linearity

Lecture 8: Beyond linearity

STATS 202: Data Mining and Analysis

Linh [email protected]

Department of StatisticsStanford University

July 19, 2021

STATS 202: Data Mining and Analysis L. Tran 1/38

Page 2: Lecture 8: Beyond linearity

Announcements

I HW2 is being grading

I Midterm is this Wednesday (good luck!)

I Will be available via Gradescope & Piazza

I Covering material up to (and including) lecture 7

I No coded portion

I Submit completed exam to Gradescope

I No session this Friday.

STATS 202: Data Mining and Analysis L. Tran 2/38

Page 3: Lecture 8: Beyond linearity

Outline

I Extending linear models

I Basis functions

I Piecewise models

I Smoothing splines

I Local models

I Geneneralized additive models

STATS 202: Data Mining and Analysis L. Tran 3/38

Page 4: Lecture 8: Beyond linearity

Recall

We can use linear models to for regression / classification,e.g.

E[Y |X] = β0 + β1X1 + β2X2 + ... (1)

I When relationships are non-linear, could e.g. includepolynomial terms (e.g. X 2,X 3, ...) or interactions (e.g.X1 ∗ X2, ...)

I Could then use e.g. stepwise regression to do model fitting

I n.b. Recall that using too many parameters can result inoverfitting

Question: Assuming n >> p, why not just do everytransformation we can think of and throw all of them into e.g.Lasso?

STATS 202: Data Mining and Analysis L. Tran 4/38

Page 5: Lecture 8: Beyond linearity

Recall

We can use linear models to for regression / classification,e.g.

E[Y |X] = β0 + β1X1 + β2X2 + ... (1)

I When relationships are non-linear, could e.g. includepolynomial terms (e.g. X 2,X 3, ...) or interactions (e.g.X1 ∗ X2, ...)

I Could then use e.g. stepwise regression to do model fitting

I n.b. Recall that using too many parameters can result inoverfitting

Question: Assuming n >> p, why not just do everytransformation we can think of and throw all of them into e.g.Lasso?

STATS 202: Data Mining and Analysis L. Tran 4/38

Page 6: Lecture 8: Beyond linearity

Example

Consider a distribution where only 20 predictors are truly associatedto the outcome (out of 20, 50, or 2000 total predictors)

1 16 21

01

23

45

1 28 51

01

23

45

1 70 111

01

23

45

p = 20 p = 50 p = 2000

Degrees of FreedomDegrees of FreedomDegrees of Freedom

I Test error increases with more predictors and parameters

I Lesson: Using too many predictors/parameters can hurtperformance

I Selecting predictors carefully can/will lead to betterperformance

STATS 202: Data Mining and Analysis L. Tran 5/38

Page 7: Lecture 8: Beyond linearity

Example

Consider a distribution where only 20 predictors are truly associatedto the outcome (out of 20, 50, or 2000 total predictors)

1 16 21

01

23

45

1 28 51

01

23

45

1 70 111

01

23

45

p = 20 p = 50 p = 2000

Degrees of FreedomDegrees of FreedomDegrees of Freedom

I Test error increases with more predictors and parameters

I Lesson: Using too many predictors/parameters can hurtperformance

I Selecting predictors carefully can/will lead to betterperformance

STATS 202: Data Mining and Analysis L. Tran 5/38

Page 8: Lecture 8: Beyond linearity

Follow-up

Question: What about if we have p ≥ n? (pretty common now,since collecting data has become cheap and easy.)Examples:

I Medicine: Instead of regressing heart disease onto just a fewclinical observations (blood pressure, salt consumption, age),we use in addition 500,000 single nucleotide polymorphisms.

I Marketing: Using search terms to understand onlineshopping patterns. A bag of words model defines one featurefor every possible search term, which counts the number oftimes the term appears in a person’s search. There can be asmany features as words in the dictionary.

STATS 202: Data Mining and Analysis L. Tran 6/38

Page 9: Lecture 8: Beyond linearity

Follow-up

When n = p, we can find a perfect fit for the training data,e.g.

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

XX

YY

When p > n, OLS doesn’t have a unique solution.

I Can use e.g. stepwise variable selection, ridge, lasso, etc.

I Still have to deal with estimating σ̂2.

STATS 202: Data Mining and Analysis L. Tran 7/38

Page 10: Lecture 8: Beyond linearity

Follow-up

When n = p, we can find a perfect fit for the training data,e.g.

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

XX

YY

When p > n, OLS doesn’t have a unique solution.

I Can use e.g. stepwise variable selection, ridge, lasso, etc.

I Still have to deal with estimating σ̂2.

STATS 202: Data Mining and Analysis L. Tran 7/38

Page 11: Lecture 8: Beyond linearity

Basis functions

Back to using polynomial terms

20 30 40 50 60 70 805

01

00

15

02

00

25

03

00

Age

Wa

ge

Degree−4 Polynomial

20 30 40 50 60 70 80

0.0

00

.05

0.1

00

.15

0.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

I Generalizing the notation, we have

E[Y |X] = β0 + β1b1(X ) + β2b2(X ) + ...+ βdbd(X ) (2)

I where bi (x) = x i

I We don’t have to just use polynomials (e.g. could usecutpoints instead)

I i.e. bi (x) = I(ci 6 x < ci+1)

STATS 202: Data Mining and Analysis L. Tran 8/38

Page 12: Lecture 8: Beyond linearity

Basis functions

Back to using polynomial terms

20 30 40 50 60 70 805

01

00

15

02

00

25

03

00

Age

Wa

ge

Degree−4 Polynomial

20 30 40 50 60 70 80

0.0

00

.05

0.1

00

.15

0.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

I Generalizing the notation, we have

E[Y |X] = β0 + β1b1(X ) + β2b2(X ) + ...+ βdbd(X ) (2)

I where bi (x) = x i

I We don’t have to just use polynomials (e.g. could usecutpoints instead)

I i.e. bi (x) = I(ci 6 x < ci+1)

STATS 202: Data Mining and Analysis L. Tran 8/38

Page 13: Lecture 8: Beyond linearity

Basis functions: Example

Example using Wage data.

20 30 40 50 60 70 80

50

10

01

50

20

02

50

30

0

Age

Wa

ge

Piecewise Constant

20 30 40 50 60 70 80

0.0

00

.05

0.1

00

.15

0.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

STATS 202: Data Mining and Analysis L. Tran 9/38

Page 14: Lecture 8: Beyond linearity

Basis functions options

We can also use other basis functions,e.g. piecewise polynomials

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Continuous Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Cubic Spline

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Linear Spline

Note: points where coefficients change are called a knots.STATS 202: Data Mining and Analysis L. Tran 10/38

Page 15: Lecture 8: Beyond linearity

Piecewise polynomials

Example of a piecewise cubic model with 1 knot & d = 3:

yi =

{β01 + β11xi + β21x

2i + β31x

3i + εi if xi < c

β02 + β12xi + β22x2i + β32x

3i + εi if xi ≥ c

(3)

i.e. We’re fitting two regression models (for xi < c and xi ≥ c)

I This can be generalized to any number of knots ≤ n

I Problem: resulting function is discontinuous

STATS 202: Data Mining and Analysis L. Tran 11/38

Page 16: Lecture 8: Beyond linearity

Cubic splines

Very popular, since they give very smooth predictions over X .

I Define a set of knots ξ1 < ξ2 < · · · < ξK .

I We want the function Y = f (X ) to:

1. Be a cubic polynomial between every pair of knots ξi , ξi+1.

2. Be continuous at each knot.

3. Have continuous first and second derivatives at each knot.

Fact: Given constraints, we need K + 3 basis functions:

f (X ) = β0+β1X +β2X2+β3X

3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)

where,

h(x , ξ) =

{(x − ξ)3 if x > ξ

0 otherwise

STATS 202: Data Mining and Analysis L. Tran 12/38

Page 17: Lecture 8: Beyond linearity

Cubic splines

Very popular, since they give very smooth predictions over X .

I Define a set of knots ξ1 < ξ2 < · · · < ξK .

I We want the function Y = f (X ) to:

1. Be a cubic polynomial between every pair of knots ξi , ξi+1.

2. Be continuous at each knot.

3. Have continuous first and second derivatives at each knot.

Fact: Given constraints, we need K + 3 basis functions:

f (X ) = β0+β1X +β2X2+β3X

3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)

where,

h(x , ξ) =

{(x − ξ)3 if x > ξ

0 otherwise

STATS 202: Data Mining and Analysis L. Tran 12/38

Page 18: Lecture 8: Beyond linearity

Cubic splines

Very popular, since they give very smooth predictions over X .

I Define a set of knots ξ1 < ξ2 < · · · < ξK .

I We want the function Y = f (X ) to:

1. Be a cubic polynomial between every pair of knots ξi , ξi+1.

2. Be continuous at each knot.

3. Have continuous first and second derivatives at each knot.

Fact: Given constraints, we need K + 3 basis functions:

f (X ) = β0+β1X +β2X2+β3X

3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)

where,

h(x , ξ) =

{(x − ξ)3 if x > ξ

0 otherwise

STATS 202: Data Mining and Analysis L. Tran 12/38

Page 19: Lecture 8: Beyond linearity

Piecewise polynomials vs cubic splines

20 30 40 50 60 7050

100

150

200

250

Age

Wage

Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Continuous Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Cubic Spline

20 30 40 50 60 70

50

100

150

200

250

AgeW

age

Linear Spline

I Piecewise model has df = 8.

I Cubic spline has df = 5.

I In general, k knots result in df = 4 + K

STATS 202: Data Mining and Analysis L. Tran 13/38

Page 20: Lecture 8: Beyond linearity

Natural cubic splines

Spline which is linear instead of cubic for X < ξ1 & X > ξK

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Natural Cubic Spline

Cubic Spline

Makes predictions more stable for extreme values of X

I Will typically have less degrees of freedom than cubic splines

STATS 202: Data Mining and Analysis L. Tran 14/38

Page 21: Lecture 8: Beyond linearity

Number and location of knots

I The locations of the knots are typically quantiles of X .

I The number of knots, K , is chosen by cross validation

2 4 6 8 10

16

00

16

20

16

40

16

60

16

80

Degrees of Freedom of Natural Spline

Me

an

Sq

ua

red

Err

or

2 4 6 8 10

16

00

16

20

16

40

16

60

16

80

Degrees of Freedom of Cubic Spline

Me

an

Sq

ua

red

Err

or

STATS 202: Data Mining and Analysis L. Tran 15/38

Page 22: Lecture 8: Beyond linearity

Splines vs polynomial regression

I Splines can fit complex functions with few parameters

I Polynomials require high degree terms to be flexible

I May also result in non-convergence

I High-degree polynomials (like vanilla cubic splines) can beunstable at the edges

20 30 40 50 60 70 80

50

10

01

50

20

02

50

30

0

Age

Wa

ge

Natural Cubic Spline

Polynomial

STATS 202: Data Mining and Analysis L. Tran 16/38

Page 23: Lecture 8: Beyond linearity

Splines vs polynomial regression

Applied example: California unemployment rates

1980 1990 2000 2010

46

810

12

Polynomial regression (df=12)

year

Une

mpl

oym

ent r

ate

1980 1990 2000 2010

4

6

8

10

12

Natural cubic spline (df=12)

year

Une

mpl

oym

ent r

ate

STATS 202: Data Mining and Analysis L. Tran 17/38

Page 24: Lecture 8: Beyond linearity

Splines vs polynomial regression

Applied example: California unemployment rates

1980 1990 2000 2010

46

810

12

Polynomial regression (df=12)

year

Une

mpl

oym

ent r

ate

1980 1990 2000 2010

4

6

8

10

12

Natural cubic spline (df=12)

year

Une

mpl

oym

ent r

ate

STATS 202: Data Mining and Analysis L. Tran 17/38

Page 25: Lecture 8: Beyond linearity

Smoothing splines

Our goal is to find the function f which minimizes

n∑i=1

(yi − f (xi ))2 + λ

∫f ′′(x)2dx

I The RSS of the model.

I A penalty for the roughness of the function.

For regularization, we have that λ ∈ (0,∞)

I When λ = 0, f can be any function that interpolates the data.

I When λ =∞, f will be the simple least squares fit

STATS 202: Data Mining and Analysis L. Tran 18/38

Page 26: Lecture 8: Beyond linearity

Smoothing splines

Our goal is to find the function f which minimizes

n∑i=1

(yi − f (xi ))2 + λ

∫f ′′(x)2dx

I The RSS of the model.

I A penalty for the roughness of the function.

For regularization, we have that λ ∈ (0,∞)

I When λ = 0, f can be any function that interpolates the data.

I When λ =∞, f will be the simple least squares fit

STATS 202: Data Mining and Analysis L. Tran 18/38

Page 27: Lecture 8: Beyond linearity

Smoothing splines

Notes:

I f̂ becomes more flexible as λ→ 0.

I The minimizer f̂ is a natural cubic spline, with knots at eachsample point x1, . . . , xn.

I Though usually not the same as directly applying natural cubicsplines due to λ

I Our function can therefore be represented as

f̂ (x) =n∑

j=1

Nj(x)θ̂j (5)

I Obtaining f̂ is similar to a Ridge regression, i.e.

θ̂ = (N>N + λΣN)−1N>y (6)

STATS 202: Data Mining and Analysis L. Tran 19/38

Page 28: Lecture 8: Beyond linearity

Natural cubic splines vs. Smoothing splines

Natural cubic splines Smoothing splines

I Fix K ≤ n and choose thelocations of K knots atquantiles of X .

I Find the natural cubic splinef̂ which minimizes the RSS:

n∑i=1

(yi − f (xi ))2

I Choose K by crossvalidation.

I Put n knots at x1, . . . , xn.I Obtain the fitted values

which minimizes:

n∑i=1

(yi−f (xi ))2+λ

∫f ′′(x)2dx

I Choose λ by crossvalidation.

STATS 202: Data Mining and Analysis L. Tran 20/38

Page 29: Lecture 8: Beyond linearity

LOOCV

Similar to linear regression, there’s a shortcut for LOOCV:

CV(n) =1

n

n∑i=1

(yi − f̂(−i)λ (xi ))2 =

1

n

n∑i=1

[yi − f̂λ(xi )

1− {Sλ}ii

]2(7)

for an n × n matrix Sλ

STATS 202: Data Mining and Analysis L. Tran 21/38

Page 30: Lecture 8: Beyond linearity

Smoothing splines

Example

20 30 40 50 60 70 80

050

100

200

300

Age

Wage

Smoothing Spline

16 Degrees of Freedom

6.8 Degrees of Freedom (LOOCV)

n.b. We’re measuring effective degrees of freedom

STATS 202: Data Mining and Analysis L. Tran 22/38

Page 31: Lecture 8: Beyond linearity

Kernel smoothing

Idea: Why not just use the subset of observations closest to thepoint we’re predicting at?

I Observations averaged locally for predictions.

I Can use different weighting kernels, e.g.

Kλ(x0, x) = D

( |x − x0|hλ(x0)

)(8)

STATS 202: Data Mining and Analysis L. Tran 23/38

Page 32: Lecture 8: Beyond linearity

Local linear regression

Idea: Kernel smoothing with regression.

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

1.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

1.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

Local Regression

e.g. Perform regression locally.

STATS 202: Data Mining and Analysis L. Tran 24/38

Page 33: Lecture 8: Beyond linearity

Local linear regression

To predict the regression function f at an input x :

1. Assign a weight Ki to the training point xi , such that:

I Ki = 0 unless xi is one of the k nearest neighbors of x .

I Ki decreases when the distance d(x , xi ) increases.

2. Perform a weighted least squares regression; i.e. find (β0, β1)which minimize

n∑i=i

Ki (yi − β0 − β1xi )2.

3. Predict f̂ (x) = β̂0 + β̂1x .

STATS 202: Data Mining and Analysis L. Tran 25/38

Page 34: Lecture 8: Beyond linearity

Local linear regression

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

1.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

1.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

Local Regression

Some notes:

I The span k/n is the fraction of training samples used in eachregression (the most important hyperparameter for the model)

I Sometimes referred to as a memory-based procedure

I Can mix this with linear models (e.g. global in some variables,but local in others)

I Performs poorly in large dimensional spaces

STATS 202: Data Mining and Analysis L. Tran 26/38

Page 35: Lecture 8: Beyond linearity

Local linear regression

Example

20 30 40 50 60 70 80

050

100

200

300

Age

Wage

Local Linear Regression

Span is 0.2 (16.4 Degrees of Freedom)

Span is 0.7 (5.3 Degrees of Freedom)

k/n, is chosen by cross-validation.STATS 202: Data Mining and Analysis L. Tran 27/38

Page 36: Lecture 8: Beyond linearity

Kernel smoothing vs Local linear regression

The main differences occur in the edges.

STATS 202: Data Mining and Analysis L. Tran 28/38

Page 37: Lecture 8: Beyond linearity

Local polynomial regression

We don’t have to stick with local linear models, e.g.

Can deal with the issue of “trimming the hills” and “filling thevalleys”.

STATS 202: Data Mining and Analysis L. Tran 29/38

Page 38: Lecture 8: Beyond linearity

Generalized additive models

The extension of basis functions to multiple predictors (whilemaintaining additivity) , e.g.

Linear model

wage = β0 + β1year + β2age + β3education + ε (9)

Additive model

wage = β0 + f1(year) + f2(age) + f3(education) + ε (10)

The functions f1, . . . , fp can be polynomials, natural splines,smoothing splines, local regressions, etc.

STATS 202: Data Mining and Analysis L. Tran 30/38

Page 39: Lecture 8: Beyond linearity

Fitting GAMs

If the functions fi have basis representations, we can use leastsquares

I Natural cubic splines

I Polynomials

I Step functions

Otherwise, we can use backfitting to fit our functions

I Smoothing splines

I Local regression

STATS 202: Data Mining and Analysis L. Tran 31/38

Page 40: Lecture 8: Beyond linearity

Backfitting a GAM

I Otherwise, we can use backfitting:

1. Keep f2, . . . , fp fixed, and fit f1 using the partial residuals:

yi − β0 − f2(xi2)− · · · − fp(xip),

as the response.

2. Keep f1, f3, . . . , fp fixed, and fit f2 using the partial residuals:

yi − β0 − f1(xi1)− f3(xi3)− · · · − fp(xip),

as the response.

3. ...

4. Iterate

I This works for smoothing splines and local regression.

STATS 202: Data Mining and Analysis L. Tran 32/38

Page 41: Lecture 8: Beyond linearity

GAM properties

I GAMs are a step from linear regression toward anonparametric method

I Mitigates need to manually try out many differenttransformations on each variable individually

I We can report degrees of freedom for most non-linearfunctions (as a way of representing model complexity)

I The only constraint is additivity, which can be partiallyaddressed through adding interaction variables, e.g. XiXj

I Similar to linear regression, we can examine the significance ofeach of the variables

STATS 202: Data Mining and Analysis L. Tran 33/38

Page 42: Lecture 8: Beyond linearity

Example: Regression for Wage

Using natural cubic splines:

2003 2005 2007 2009

−30

−20

−10

010

20

30

20 30 40 50 60 70 80

−50

−40

−30

−20

−10

010

20

−30

−20

−10

010

20

30

40

<HS HS <Coll Coll >Coll

f 1(year)

f 2(age)

f 3(edu

cation)

year ageeducation

I year: natural spline with df=4.

I age: natural spline with df=5.

I education: step function.

STATS 202: Data Mining and Analysis L. Tran 34/38

Page 43: Lecture 8: Beyond linearity

Example: Regression for Wage

Using smoothing splines:

2003 2005 2007 2009

−30

−20

−10

010

20

30

20 30 40 50 60 70 80

−50

−40

−30

−20

−10

010

20

−30

−20

−10

010

20

30

40

<HS HS <Coll Coll >Coll

f 1(year)

f 2(age)

f 3(edu

cation)

year ageeducation

I year: smoothing spline with df=4.

I age: smoothing spline with df=5.

I education: step function.

STATS 202: Data Mining and Analysis L. Tran 35/38

Page 44: Lecture 8: Beyond linearity

GAMs for Classification

Recall: We can use logistic regression for estimating ourconditional probabilities.

GAMS very naturally apply over to this:

logP(Y = 1 | X )

P(Y = 0 | X )= β0 + f1(X1) + · · ·+ fp(Xp).

The fitting algorithm is a version of backfitting, but we won’tdiscuss the details.

STATS 202: Data Mining and Analysis L. Tran 36/38

Page 45: Lecture 8: Beyond linearity

Example: Classification for Wage>250

2003 2005 2007 2009

−4

−2

02

4

20 30 40 50 60 70 80−

8−

6−

4−

20

2

−400

−200

0200

400

<HS HS <Coll Coll >Collf 1(year)

f 2(age)

f 3(edu

cation)

year ageeducation

I year: linear

I age: smoothing spline with df=5

I education: step function

n.b. The confidence interval for <HS is very large

STATS 202: Data Mining and Analysis L. Tran 37/38

Page 46: Lecture 8: Beyond linearity

References

[1] ISL. Chapter 6.4-7

[2] ESL. Chapter 5

STATS 202: Data Mining and Analysis L. Tran 38/38