Lecture 8: Beyond linearity
Transcript of Lecture 8: Beyond linearity
Lecture 8: Beyond linearity
STATS 202: Data Mining and Analysis
Linh [email protected]
Department of StatisticsStanford University
July 19, 2021
STATS 202: Data Mining and Analysis L. Tran 1/38
Announcements
I HW2 is being grading
I Midterm is this Wednesday (good luck!)
I Will be available via Gradescope & Piazza
I Covering material up to (and including) lecture 7
I No coded portion
I Submit completed exam to Gradescope
I No session this Friday.
STATS 202: Data Mining and Analysis L. Tran 2/38
Outline
I Extending linear models
I Basis functions
I Piecewise models
I Smoothing splines
I Local models
I Geneneralized additive models
STATS 202: Data Mining and Analysis L. Tran 3/38
Recall
We can use linear models to for regression / classification,e.g.
E[Y |X] = β0 + β1X1 + β2X2 + ... (1)
I When relationships are non-linear, could e.g. includepolynomial terms (e.g. X 2,X 3, ...) or interactions (e.g.X1 ∗ X2, ...)
I Could then use e.g. stepwise regression to do model fitting
I n.b. Recall that using too many parameters can result inoverfitting
Question: Assuming n >> p, why not just do everytransformation we can think of and throw all of them into e.g.Lasso?
STATS 202: Data Mining and Analysis L. Tran 4/38
Recall
We can use linear models to for regression / classification,e.g.
E[Y |X] = β0 + β1X1 + β2X2 + ... (1)
I When relationships are non-linear, could e.g. includepolynomial terms (e.g. X 2,X 3, ...) or interactions (e.g.X1 ∗ X2, ...)
I Could then use e.g. stepwise regression to do model fitting
I n.b. Recall that using too many parameters can result inoverfitting
Question: Assuming n >> p, why not just do everytransformation we can think of and throw all of them into e.g.Lasso?
STATS 202: Data Mining and Analysis L. Tran 4/38
Example
Consider a distribution where only 20 predictors are truly associatedto the outcome (out of 20, 50, or 2000 total predictors)
1 16 21
01
23
45
1 28 51
01
23
45
1 70 111
01
23
45
p = 20 p = 50 p = 2000
Degrees of FreedomDegrees of FreedomDegrees of Freedom
I Test error increases with more predictors and parameters
I Lesson: Using too many predictors/parameters can hurtperformance
I Selecting predictors carefully can/will lead to betterperformance
STATS 202: Data Mining and Analysis L. Tran 5/38
Example
Consider a distribution where only 20 predictors are truly associatedto the outcome (out of 20, 50, or 2000 total predictors)
1 16 21
01
23
45
1 28 51
01
23
45
1 70 111
01
23
45
p = 20 p = 50 p = 2000
Degrees of FreedomDegrees of FreedomDegrees of Freedom
I Test error increases with more predictors and parameters
I Lesson: Using too many predictors/parameters can hurtperformance
I Selecting predictors carefully can/will lead to betterperformance
STATS 202: Data Mining and Analysis L. Tran 5/38
Follow-up
Question: What about if we have p ≥ n? (pretty common now,since collecting data has become cheap and easy.)Examples:
I Medicine: Instead of regressing heart disease onto just a fewclinical observations (blood pressure, salt consumption, age),we use in addition 500,000 single nucleotide polymorphisms.
I Marketing: Using search terms to understand onlineshopping patterns. A bag of words model defines one featurefor every possible search term, which counts the number oftimes the term appears in a person’s search. There can be asmany features as words in the dictionary.
STATS 202: Data Mining and Analysis L. Tran 6/38
Follow-up
When n = p, we can find a perfect fit for the training data,e.g.
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
XX
YY
When p > n, OLS doesn’t have a unique solution.
I Can use e.g. stepwise variable selection, ridge, lasso, etc.
I Still have to deal with estimating σ̂2.
STATS 202: Data Mining and Analysis L. Tran 7/38
Follow-up
When n = p, we can find a perfect fit for the training data,e.g.
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
XX
YY
When p > n, OLS doesn’t have a unique solution.
I Can use e.g. stepwise variable selection, ridge, lasso, etc.
I Still have to deal with estimating σ̂2.
STATS 202: Data Mining and Analysis L. Tran 7/38
Basis functions
Back to using polynomial terms
20 30 40 50 60 70 805
01
00
15
02
00
25
03
00
Age
Wa
ge
Degree−4 Polynomial
20 30 40 50 60 70 80
0.0
00
.05
0.1
00
.15
0.2
0
Age
| | || | ||| | ||| | | ||| | || | | |||
|
|| || ||| | | | || || || | || |
|
|| | | |
|
| || || || | | || | ||| || ||| | | |
|
| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||
|
|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||
|
||| | || | || || | || | ||| || || | || ||| |
|
| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |
|
| |||| ||| || || || ||| | | ||
|
|| |
|
| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||
|
| || || || || || |
|
| |||| || || |||| | || || || ||| | || |||| || |
|
| | |||| || || || |
|
|| |||| ||| ||
|
||| || |||| | || || | | |||
|
||| || | || | || | || || ||||| | | ||| |
|
| | || || ||| ||| | || |
|
|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |
|
| ||| | || || | | || |
|
| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |
|
|| ||||| ||| | | || || || || || || || |
|
| || || || | ||| || || | || || |||| |||| |
|
| || || ||| || | | ||| || || | ||| ||| || || |
|
||| || || || || || | | ||| | || ||| || || | |||| || | |
|
|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |
|
| || |||| ||| | |||
|
| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||
|
|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |
|
|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||
|
| || || ||| | | ||| | |||| | || || ||||
|
| | || | || | || | |||
|
| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||
|
||| ||| | || || || | | || | || || || || || || | || || | || || |
|
| || ||| || |
| |
| ||| | || || |
|
| |||| ||| | |||| ||
|
| ||| ||| ||| |||| |
|
| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||
|
| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||
|
| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |
|
| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||
|
|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |
|
|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||
|
|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||
|
|| | |
|
||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |
|
||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||
|
| || | | || | || || | || || || | |||| | | | ||| | | ||
|
| | || ||
|
|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||
|
|||| |
|
|| | |||| ||| | || || ||| || ||| | |||| || |
|
|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |
|
| || | |||
|
| | || || | ||| || |
|
| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |
|
|| || |||| | || |||| |
||
| | | ||||| |||
|
|| |||| | |||| || |
|
| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |
|
| || ||
|
|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||
|
|| ||| | |
| |
|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||
|
| |||
|
| ||
|
| |
|
|
|
| | | || || |||
|
|||| ||
|
|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |
|
| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||
|
| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||
|
| | ||| || | | | ||
|
| | || | | ||| | || | || | ||| || || ||| | | || |
|
|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |
Pr(Wage>
250|A
ge)
I Generalizing the notation, we have
E[Y |X] = β0 + β1b1(X ) + β2b2(X ) + ...+ βdbd(X ) (2)
I where bi (x) = x i
I We don’t have to just use polynomials (e.g. could usecutpoints instead)
I i.e. bi (x) = I(ci 6 x < ci+1)
STATS 202: Data Mining and Analysis L. Tran 8/38
Basis functions
Back to using polynomial terms
20 30 40 50 60 70 805
01
00
15
02
00
25
03
00
Age
Wa
ge
Degree−4 Polynomial
20 30 40 50 60 70 80
0.0
00
.05
0.1
00
.15
0.2
0
Age
| | || | ||| | ||| | | ||| | || | | |||
|
|| || ||| | | | || || || | || |
|
|| | | |
|
| || || || | | || | ||| || ||| | | |
|
| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||
|
|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||
|
||| | || | || || | || | ||| || || | || ||| |
|
| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |
|
| |||| ||| || || || ||| | | ||
|
|| |
|
| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||
|
| || || || || || |
|
| |||| || || |||| | || || || ||| | || |||| || |
|
| | |||| || || || |
|
|| |||| ||| ||
|
||| || |||| | || || | | |||
|
||| || | || | || | || || ||||| | | ||| |
|
| | || || ||| ||| | || |
|
|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |
|
| ||| | || || | | || |
|
| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |
|
|| ||||| ||| | | || || || || || || || |
|
| || || || | ||| || || | || || |||| |||| |
|
| || || ||| || | | ||| || || | ||| ||| || || |
|
||| || || || || || | | ||| | || ||| || || | |||| || | |
|
|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |
|
| || |||| ||| | |||
|
| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||
|
|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |
|
|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||
|
| || || ||| | | ||| | |||| | || || ||||
|
| | || | || | || | |||
|
| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||
|
||| ||| | || || || | | || | || || || || || || | || || | || || |
|
| || ||| || |
| |
| ||| | || || |
|
| |||| ||| | |||| ||
|
| ||| ||| ||| |||| |
|
| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||
|
| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||
|
| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |
|
| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||
|
|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |
|
|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||
|
|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||
|
|| | |
|
||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |
|
||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||
|
| || | | || | || || | || || || | |||| | | | ||| | | ||
|
| | || ||
|
|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||
|
|||| |
|
|| | |||| ||| | || || ||| || ||| | |||| || |
|
|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |
|
| || | |||
|
| | || || | ||| || |
|
| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |
|
|| || |||| | || |||| |
||
| | | ||||| |||
|
|| |||| | |||| || |
|
| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |
|
| || ||
|
|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||
|
|| ||| | |
| |
|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||
|
| |||
|
| ||
|
| |
|
|
|
| | | || || |||
|
|||| ||
|
|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |
|
| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||
|
| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||
|
| | ||| || | | | ||
|
| | || | | ||| | || | || | ||| || || ||| | | || |
|
|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |
Pr(Wage>
250|A
ge)
I Generalizing the notation, we have
E[Y |X] = β0 + β1b1(X ) + β2b2(X ) + ...+ βdbd(X ) (2)
I where bi (x) = x i
I We don’t have to just use polynomials (e.g. could usecutpoints instead)
I i.e. bi (x) = I(ci 6 x < ci+1)
STATS 202: Data Mining and Analysis L. Tran 8/38
Basis functions: Example
Example using Wage data.
20 30 40 50 60 70 80
50
10
01
50
20
02
50
30
0
Age
Wa
ge
Piecewise Constant
20 30 40 50 60 70 80
0.0
00
.05
0.1
00
.15
0.2
0
Age
| | || | ||| | ||| | | ||| | || | | |||
|
|| || ||| | | | || || || | || |
|
|| | | |
|
| || || || | | || | ||| || ||| | | |
|
| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||
|
|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||
|
||| | || | || || | || | ||| || || | || ||| |
|
| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |
|
| |||| ||| || || || ||| | | ||
|
|| |
|
| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||
|
| || || || || || |
|
| |||| || || |||| | || || || ||| | || |||| || |
|
| | |||| || || || |
|
|| |||| ||| ||
|
||| || |||| | || || | | |||
|
||| || | || | || | || || ||||| | | ||| |
|
| | || || ||| ||| | || |
|
|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |
|
| ||| | || || | | || |
|
| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |
|
|| ||||| ||| | | || || || || || || || |
|
| || || || | ||| || || | || || |||| |||| |
|
| || || ||| || | | ||| || || | ||| ||| || || |
|
||| || || || || || | | ||| | || ||| || || | |||| || | |
|
|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |
|
| || |||| ||| | |||
|
| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||
|
|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |
|
|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||
|
| || || ||| | | ||| | |||| | || || ||||
|
| | || | || | || | |||
|
| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||
|
||| ||| | || || || | | || | || || || || || || | || || | || || |
|
| || ||| || |
| |
| ||| | || || |
|
| |||| ||| | |||| ||
|
| ||| ||| ||| |||| |
|
| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||
|
| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||
|
| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |
|
| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||
|
|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |
|
|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||
|
|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||
|
|| | |
|
||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |
|
||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||
|
| || | | || | || || | || || || | |||| | | | ||| | | ||
|
| | || ||
|
|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||
|
|||| |
|
|| | |||| ||| | || || ||| || ||| | |||| || |
|
|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |
|
| || | |||
|
| | || || | ||| || |
|
| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |
|
|| || |||| | || |||| |
||
| | | ||||| |||
|
|| |||| | |||| || |
|
| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |
|
| || ||
|
|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||
|
|| ||| | |
| |
|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||
|
| |||
|
| ||
|
| |
|
|
|
| | | || || |||
|
|||| ||
|
|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |
|
| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||
|
| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||
|
| | ||| || | | | ||
|
| | || | | ||| | || | || | ||| || || ||| | | || |
|
|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |
Pr(Wage>
250|A
ge)
STATS 202: Data Mining and Analysis L. Tran 9/38
Basis functions options
We can also use other basis functions,e.g. piecewise polynomials
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Piecewise Cubic
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Continuous Piecewise Cubic
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Cubic Spline
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Linear Spline
Note: points where coefficients change are called a knots.STATS 202: Data Mining and Analysis L. Tran 10/38
Piecewise polynomials
Example of a piecewise cubic model with 1 knot & d = 3:
yi =
{β01 + β11xi + β21x
2i + β31x
3i + εi if xi < c
β02 + β12xi + β22x2i + β32x
3i + εi if xi ≥ c
(3)
i.e. We’re fitting two regression models (for xi < c and xi ≥ c)
I This can be generalized to any number of knots ≤ n
I Problem: resulting function is discontinuous
STATS 202: Data Mining and Analysis L. Tran 11/38
Cubic splines
Very popular, since they give very smooth predictions over X .
I Define a set of knots ξ1 < ξ2 < · · · < ξK .
I We want the function Y = f (X ) to:
1. Be a cubic polynomial between every pair of knots ξi , ξi+1.
2. Be continuous at each knot.
3. Have continuous first and second derivatives at each knot.
Fact: Given constraints, we need K + 3 basis functions:
f (X ) = β0+β1X +β2X2+β3X
3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)
where,
h(x , ξ) =
{(x − ξ)3 if x > ξ
0 otherwise
STATS 202: Data Mining and Analysis L. Tran 12/38
Cubic splines
Very popular, since they give very smooth predictions over X .
I Define a set of knots ξ1 < ξ2 < · · · < ξK .
I We want the function Y = f (X ) to:
1. Be a cubic polynomial between every pair of knots ξi , ξi+1.
2. Be continuous at each knot.
3. Have continuous first and second derivatives at each knot.
Fact: Given constraints, we need K + 3 basis functions:
f (X ) = β0+β1X +β2X2+β3X
3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)
where,
h(x , ξ) =
{(x − ξ)3 if x > ξ
0 otherwise
STATS 202: Data Mining and Analysis L. Tran 12/38
Cubic splines
Very popular, since they give very smooth predictions over X .
I Define a set of knots ξ1 < ξ2 < · · · < ξK .
I We want the function Y = f (X ) to:
1. Be a cubic polynomial between every pair of knots ξi , ξi+1.
2. Be continuous at each knot.
3. Have continuous first and second derivatives at each knot.
Fact: Given constraints, we need K + 3 basis functions:
f (X ) = β0+β1X +β2X2+β3X
3+β4h(X , ξ1)+ · · ·+βK+3h(X , ξK )(4)
where,
h(x , ξ) =
{(x − ξ)3 if x > ξ
0 otherwise
STATS 202: Data Mining and Analysis L. Tran 12/38
Piecewise polynomials vs cubic splines
20 30 40 50 60 7050
100
150
200
250
Age
Wage
Piecewise Cubic
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Continuous Piecewise Cubic
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Cubic Spline
20 30 40 50 60 70
50
100
150
200
250
AgeW
age
Linear Spline
I Piecewise model has df = 8.
I Cubic spline has df = 5.
I In general, k knots result in df = 4 + K
STATS 202: Data Mining and Analysis L. Tran 13/38
Natural cubic splines
Spline which is linear instead of cubic for X < ξ1 & X > ξK
20 30 40 50 60 70
50
100
150
200
250
Age
Wage
Natural Cubic Spline
Cubic Spline
Makes predictions more stable for extreme values of X
I Will typically have less degrees of freedom than cubic splines
STATS 202: Data Mining and Analysis L. Tran 14/38
Number and location of knots
I The locations of the knots are typically quantiles of X .
I The number of knots, K , is chosen by cross validation
2 4 6 8 10
16
00
16
20
16
40
16
60
16
80
Degrees of Freedom of Natural Spline
Me
an
Sq
ua
red
Err
or
2 4 6 8 10
16
00
16
20
16
40
16
60
16
80
Degrees of Freedom of Cubic Spline
Me
an
Sq
ua
red
Err
or
STATS 202: Data Mining and Analysis L. Tran 15/38
Splines vs polynomial regression
I Splines can fit complex functions with few parameters
I Polynomials require high degree terms to be flexible
I May also result in non-convergence
I High-degree polynomials (like vanilla cubic splines) can beunstable at the edges
20 30 40 50 60 70 80
50
10
01
50
20
02
50
30
0
Age
Wa
ge
Natural Cubic Spline
Polynomial
STATS 202: Data Mining and Analysis L. Tran 16/38
Splines vs polynomial regression
Applied example: California unemployment rates
1980 1990 2000 2010
46
810
12
Polynomial regression (df=12)
year
Une
mpl
oym
ent r
ate
1980 1990 2000 2010
4
6
8
10
12
Natural cubic spline (df=12)
year
Une
mpl
oym
ent r
ate
STATS 202: Data Mining and Analysis L. Tran 17/38
Splines vs polynomial regression
Applied example: California unemployment rates
1980 1990 2000 2010
46
810
12
Polynomial regression (df=12)
year
Une
mpl
oym
ent r
ate
1980 1990 2000 2010
4
6
8
10
12
Natural cubic spline (df=12)
year
Une
mpl
oym
ent r
ate
STATS 202: Data Mining and Analysis L. Tran 17/38
Smoothing splines
Our goal is to find the function f which minimizes
n∑i=1
(yi − f (xi ))2 + λ
∫f ′′(x)2dx
I The RSS of the model.
I A penalty for the roughness of the function.
For regularization, we have that λ ∈ (0,∞)
I When λ = 0, f can be any function that interpolates the data.
I When λ =∞, f will be the simple least squares fit
STATS 202: Data Mining and Analysis L. Tran 18/38
Smoothing splines
Our goal is to find the function f which minimizes
n∑i=1
(yi − f (xi ))2 + λ
∫f ′′(x)2dx
I The RSS of the model.
I A penalty for the roughness of the function.
For regularization, we have that λ ∈ (0,∞)
I When λ = 0, f can be any function that interpolates the data.
I When λ =∞, f will be the simple least squares fit
STATS 202: Data Mining and Analysis L. Tran 18/38
Smoothing splines
Notes:
I f̂ becomes more flexible as λ→ 0.
I The minimizer f̂ is a natural cubic spline, with knots at eachsample point x1, . . . , xn.
I Though usually not the same as directly applying natural cubicsplines due to λ
I Our function can therefore be represented as
f̂ (x) =n∑
j=1
Nj(x)θ̂j (5)
I Obtaining f̂ is similar to a Ridge regression, i.e.
θ̂ = (N>N + λΣN)−1N>y (6)
STATS 202: Data Mining and Analysis L. Tran 19/38
Natural cubic splines vs. Smoothing splines
Natural cubic splines Smoothing splines
I Fix K ≤ n and choose thelocations of K knots atquantiles of X .
I Find the natural cubic splinef̂ which minimizes the RSS:
n∑i=1
(yi − f (xi ))2
I Choose K by crossvalidation.
I Put n knots at x1, . . . , xn.I Obtain the fitted values
which minimizes:
n∑i=1
(yi−f (xi ))2+λ
∫f ′′(x)2dx
I Choose λ by crossvalidation.
STATS 202: Data Mining and Analysis L. Tran 20/38
LOOCV
Similar to linear regression, there’s a shortcut for LOOCV:
CV(n) =1
n
n∑i=1
(yi − f̂(−i)λ (xi ))2 =
1
n
n∑i=1
[yi − f̂λ(xi )
1− {Sλ}ii
]2(7)
for an n × n matrix Sλ
STATS 202: Data Mining and Analysis L. Tran 21/38
Smoothing splines
Example
20 30 40 50 60 70 80
050
100
200
300
Age
Wage
Smoothing Spline
16 Degrees of Freedom
6.8 Degrees of Freedom (LOOCV)
n.b. We’re measuring effective degrees of freedom
STATS 202: Data Mining and Analysis L. Tran 22/38
Kernel smoothing
Idea: Why not just use the subset of observations closest to thepoint we’re predicting at?
I Observations averaged locally for predictions.
I Can use different weighting kernels, e.g.
Kλ(x0, x) = D
( |x − x0|hλ(x0)
)(8)
STATS 202: Data Mining and Analysis L. Tran 23/38
Local linear regression
Idea: Kernel smoothing with regression.
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
0.5
1.0
1.5
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
O
OO
O
O
O
O
O
O
O
OO
O
O
O
OO
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
0.5
1.0
1.5
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
O
OO
O
O
O
O
O
O
O
OO
O
O
O
OO
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
Local Regression
e.g. Perform regression locally.
STATS 202: Data Mining and Analysis L. Tran 24/38
Local linear regression
To predict the regression function f at an input x :
1. Assign a weight Ki to the training point xi , such that:
I Ki = 0 unless xi is one of the k nearest neighbors of x .
I Ki decreases when the distance d(x , xi ) increases.
2. Perform a weighted least squares regression; i.e. find (β0, β1)which minimize
n∑i=i
Ki (yi − β0 − β1xi )2.
3. Predict f̂ (x) = β̂0 + β̂1x .
STATS 202: Data Mining and Analysis L. Tran 25/38
Local linear regression
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
0.5
1.0
1.5
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
O
OO
O
O
O
O
O
O
O
OO
O
O
O
OO
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
0.5
1.0
1.5
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
O
OO
O
O
O
O
O
O
O
OO
O
O
O
OO
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
OO
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
OO
O
O
O
O
OO
O
O
OO
O
O
Local Regression
Some notes:
I The span k/n is the fraction of training samples used in eachregression (the most important hyperparameter for the model)
I Sometimes referred to as a memory-based procedure
I Can mix this with linear models (e.g. global in some variables,but local in others)
I Performs poorly in large dimensional spaces
STATS 202: Data Mining and Analysis L. Tran 26/38
Local linear regression
Example
20 30 40 50 60 70 80
050
100
200
300
Age
Wage
Local Linear Regression
Span is 0.2 (16.4 Degrees of Freedom)
Span is 0.7 (5.3 Degrees of Freedom)
k/n, is chosen by cross-validation.STATS 202: Data Mining and Analysis L. Tran 27/38
Kernel smoothing vs Local linear regression
The main differences occur in the edges.
STATS 202: Data Mining and Analysis L. Tran 28/38
Local polynomial regression
We don’t have to stick with local linear models, e.g.
Can deal with the issue of “trimming the hills” and “filling thevalleys”.
STATS 202: Data Mining and Analysis L. Tran 29/38
Generalized additive models
The extension of basis functions to multiple predictors (whilemaintaining additivity) , e.g.
Linear model
wage = β0 + β1year + β2age + β3education + ε (9)
Additive model
wage = β0 + f1(year) + f2(age) + f3(education) + ε (10)
The functions f1, . . . , fp can be polynomials, natural splines,smoothing splines, local regressions, etc.
STATS 202: Data Mining and Analysis L. Tran 30/38
Fitting GAMs
If the functions fi have basis representations, we can use leastsquares
I Natural cubic splines
I Polynomials
I Step functions
Otherwise, we can use backfitting to fit our functions
I Smoothing splines
I Local regression
STATS 202: Data Mining and Analysis L. Tran 31/38
Backfitting a GAM
I Otherwise, we can use backfitting:
1. Keep f2, . . . , fp fixed, and fit f1 using the partial residuals:
yi − β0 − f2(xi2)− · · · − fp(xip),
as the response.
2. Keep f1, f3, . . . , fp fixed, and fit f2 using the partial residuals:
yi − β0 − f1(xi1)− f3(xi3)− · · · − fp(xip),
as the response.
3. ...
4. Iterate
I This works for smoothing splines and local regression.
STATS 202: Data Mining and Analysis L. Tran 32/38
GAM properties
I GAMs are a step from linear regression toward anonparametric method
I Mitigates need to manually try out many differenttransformations on each variable individually
I We can report degrees of freedom for most non-linearfunctions (as a way of representing model complexity)
I The only constraint is additivity, which can be partiallyaddressed through adding interaction variables, e.g. XiXj
I Similar to linear regression, we can examine the significance ofeach of the variables
STATS 202: Data Mining and Analysis L. Tran 33/38
Example: Regression for Wage
Using natural cubic splines:
2003 2005 2007 2009
−30
−20
−10
010
20
30
20 30 40 50 60 70 80
−50
−40
−30
−20
−10
010
20
−30
−20
−10
010
20
30
40
<HS HS <Coll Coll >Coll
f 1(year)
f 2(age)
f 3(edu
cation)
year ageeducation
I year: natural spline with df=4.
I age: natural spline with df=5.
I education: step function.
STATS 202: Data Mining and Analysis L. Tran 34/38
Example: Regression for Wage
Using smoothing splines:
2003 2005 2007 2009
−30
−20
−10
010
20
30
20 30 40 50 60 70 80
−50
−40
−30
−20
−10
010
20
−30
−20
−10
010
20
30
40
<HS HS <Coll Coll >Coll
f 1(year)
f 2(age)
f 3(edu
cation)
year ageeducation
I year: smoothing spline with df=4.
I age: smoothing spline with df=5.
I education: step function.
STATS 202: Data Mining and Analysis L. Tran 35/38
GAMs for Classification
Recall: We can use logistic regression for estimating ourconditional probabilities.
GAMS very naturally apply over to this:
logP(Y = 1 | X )
P(Y = 0 | X )= β0 + f1(X1) + · · ·+ fp(Xp).
The fitting algorithm is a version of backfitting, but we won’tdiscuss the details.
STATS 202: Data Mining and Analysis L. Tran 36/38
Example: Classification for Wage>250
2003 2005 2007 2009
−4
−2
02
4
20 30 40 50 60 70 80−
8−
6−
4−
20
2
−400
−200
0200
400
<HS HS <Coll Coll >Collf 1(year)
f 2(age)
f 3(edu
cation)
year ageeducation
I year: linear
I age: smoothing spline with df=5
I education: step function
n.b. The confidence interval for <HS is very large
STATS 202: Data Mining and Analysis L. Tran 37/38
References
[1] ISL. Chapter 6.4-7
[2] ESL. Chapter 5
STATS 202: Data Mining and Analysis L. Tran 38/38