Simple linear regression Linear regression with one predictor variable.
Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....
Transcript of Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....
-
18-661 Introduction to Machine Learning
Linear Regression – II
Spring 2020
ECE – Carnegie Mellon University
-
Announcements
• Homework 1 due today.
• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
-
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR
• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
-
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
-
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
-
Announcements
• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency
• The class waitlist is almost clear now. Any questions aboutregistration?
• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali
1
-
Today’s Class: Practical Issues with Using
Linear Regression and How to Address Them
1
-
Outline
1. Review of Linear Regression
2. Gradient Descent Methods
3. Feature Scaling
4. Ridge regression
5. Non-linear Basis Functions
6. Overfitting
2
-
Review of Linear Regression
-
Example: Predicting house prices
Sale price ≈ price per sqft × square footage + fixed expense
3
-
Minimize squared errors
Our model:
Sale price =
price per sqft× square footage + fixed expense + unexplainable stuffTraining data:
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·
Aim:
Adjust price per sqft and fixed expense such that the sum of the squared
error is minimized — i.e., the unexplainable stuff is minimized.
4
-
Linear regression
Setup:
• Input: x ∈ RD (covariates, predictors, features, etc)• Output: y ∈ R (responses, targets, outcomes, outputs, etc)• Model: f : x→ y , with f (x) = w0 +
∑Dd=1 wdxd = w0 + w
>x.
• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector• w0 is called bias.• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.
• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}
Minimize the Residual sum of squares:
RSS(w) =N∑
n=1
[yn − f (xn)]2 =N∑
n=1
[yn − (w0 +D∑
d=1
wdxnd )]2
5
-
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
6
-
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0.
7
-
A simple case: x is just one-dimensional (D=1)
Residual sum of squares:
RSS(w) =∑
n
[yn − f (xn)]2 =∑
n
[yn − (w0 + w1xn)]2
Stationary points:
Take derivative with respect to parameters and set it to zero
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0,
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0.
7
-
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
-
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
-
A simple case: x is just one-dimensional (D=1)
∂RSS(w)
∂w0= 0⇒ −2
∑n
[yn − (w0 + w1xn)] = 0
∂RSS(w)
∂w1= 0⇒ −2
∑n
[yn − (w0 + w1xn)]xn = 0
Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1
∑xn∑
xnyn = w0∑
xn + w1∑
x2n
Solving the system we obtain the least squares coefficient estimates:
w1 =
∑(xn − x̄)(yn − ȳ)∑
(xi − x̄)2and w0 = ȳ − w1x̄
where x̄ = 1N∑
n xn and ȳ =1N
∑n yn.
8
-
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
-
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1)
, y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
-
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN
Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
-
Least Mean Squares when x is D-dimensional
RSS(w) in matrix form:
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd )]2 =
∑n
[yn −w>xn]2,
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RNCompact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
9
-
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
∈ RN×(D+1), y =
y1y2...
yN
∈ RN. Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
10
-
Example: RSS(w) in compact form
sqft (1000’s) bedrooms bathrooms sale price (100k)
1 2 1 2
2 2 2 3.5
1.5 3 2 3
2.5 4 2.5 4.5
Design matrix and target vector:
X =
x>1x>2...
x>N
=
1 1 2 1
1 2 2 2
1 1.5 3 2
1 2.5 4 2.5
, y =
2
3.5
3
4.5
. Compact expression:
RSS(w) = ‖Xw − y‖22 ={
w>X>Xw − 2(X>y
)>w}
+ const
11
-
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
12
-
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
-
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
-
Least-Squares Solution
Compact expression
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
Gradients of Linear and Quadratic Functions
• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)
Normal equation
∇wRSS(w) = 2X>Xw − 2X>y = 0
This leads to the least-mean-squares (LMS) solution
wLMS =(X>X
)−1X>y
13
-
Gradient Descent Methods
-
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
14
-
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
15
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X
• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3)
– Impractical for very large D or N
16
-
Computational complexity
Bottleneck of computing the solution?
w =(X>X
)−1X>y
How many operations do we need?
• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent
theoretical advances) for matrix inversion of X>X
• O(ND) for matrix multiplication X>y
• O(D2) for(X>X
)−1times X>y
O(ND2) + O(D3) – Impractical for very large D or N
16
-
Alternative method: Batch Gradient Descent
(Batch) Gradient descent
• Initialize w to w (0) (e.g., randomly);set t = 0; choose η > 0
• Loop until convergence1. Compute the gradient
∇RSS(w) = X>(Xw (t) − y)2. Update the parameters
w (t+1) = w (t) − η∇RSS(w)3. t ← t + 1
What is the complexity of each iteration?
O(ND)
17
-
Why would this work?
If gradient descent converges, it will converge to the same solution as
using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = w>X>Xw − 2(X>y
)>w + const
⇒ ∂2RSS(w)∂ww>
= 2X>X
X>X is positive semidefinite, because for any v
v>X>Xv = ‖X>v‖22 ≥ 0
18
-
Three Optimization Methods
Want to Minimize
RSS(w) = ||Xw − y||22 ={
w>X>Xw − 2(X>y
)>w}
+ const
• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent
19
-
Stochastic gradient descent (SGD)
Widrow-Hoff rule: update parameters using one example at a time
• Initialize w to some w (0); set t = 0; choose η > 0• Loop until convergence
1. random choose a training a sample x t2. Compute its contribution to the gradient
g t = (x>t w
(t) − yt)xt
3. Update the parameters
w (t+1) = w (t) − ηg t4. t ← t + 1
How does the complexity per iteration compare with gradient descent?
• O(ND) for gradient descent versus O(D) for SGD
20
-
SGD versus Batch GD
• SGD reduces per-iteration complexity from O(ND) to O(D)• But it is noisier and can take longer to converge
21
-
Example: Comparing the Three Methods
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
regressiondata intelligence
= ??
house size
pirc
e ($
)
22
-
Example: Least Squares Solution
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
The w0 and w1 that minimize this are given by:
wLMS =(X>X
)−1X>y
[w0w1
]=
[
1 1 1 1
1 2 1.5 2.5
]1 1
1 2
1 1.5
1 2.5
−1 [
1 1 1 1
1 2 1.5 2.5
]2
3.5
3
4.5
23
-
Example: Least Squares Solution
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
The w0 and w1 that minimize this are given by:
wLMS =(X>X
)−1X>y
[w0w1
]=
[
1 1 1 1
1 2 1.5 2.5
]1 1
1 2
1 1.5
1 2.5
−1 [
1 1 1 1
1 2 1.5 2.5
]2
3.5
3
4.5
[w0w1
]=
[0.45
1.6
]Minimum RSS is RSS∗ = ||XwLMS − y||22 = 0.2236
24
-
Example: Batch Gradient Descent
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
25
-
Larger η gives faster convergence
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
η = 0.1
26
-
But too large η makes GD unstable
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0
20
40
RS
SV
alu
e
η = 0.01
η = 0.1
η = 0.12
27
-
Example: Stochastic Gradient Descent
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
28
-
Larger η gives faster convergence
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
η = 0.1
29
-
But too large η makes SGD unstable
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − η(
x>t w(t) − y
)xt
0 20 40Number of Iterations
0.5
1.0
1.5
2.0
RS
SV
alu
e
η = 0.05
η = 0.1
η = 0.25
30
-
How to Choose Learning Rate η in practice?
• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more onthis later) and choose the one that gives fastest, stable convergence
• Reduce η by a constant factor (eg. 10) when learning saturates sothat we can reach closer to the true minimum.
• More advanced learning rate schedules such as AdaGrad, Adam,AdaDelta are used in practice.
31
-
Summary of Gradient Descent Methods
• Batch gradient descent computes the exact gradient.• Stochastic gradient descent approximates the gradient with a single
data point; its expectation equals the true gradient.
• Mini-batch variant: set the batch size to trade-off between accuracyof estimating gradient and computational cost
• Similar ideas extend to other ML optimization problems.
32
-
Feature Scaling
-
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
33
-
Batch Gradient Descent: Scaled Features
sqft (1000’s) sale price (100k)
1 2
2 3.5
1.5 3
2.5 4.5
w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(
Xw (t) − y)
0 20 40Number of Iterations
0.5
1.0
1.5
RS
SV
alu
e
η = 0.01
η = 0.1
34
-
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)
• ∇RSS(w (t)) = X>(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
-
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>
(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
-
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>
(Xw (t) − y
)becomes HUGE, causing instability
• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
50000
100000
150000
200000
RS
SV
alu
e
η = 0.0000001
35
-
Batch Gradient Descent: Without Feature Scaling
sqft sale price
1000 200,000
2000 350,000
1500 300,000
2500 450,000
• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w) becomes HUGE, causing instability• We need a tiny η to compensate, but this leads to slow convergence
0 20 40Number of Iterations
0.00
0.25
0.50
0.75
1.00
RS
SV
alu
e
×10110
η = 0.0000001
η = 0.00001
36
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1
• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization)
Labels y (1), . . . y (N) should be similarly re-scaled
37
-
How to Scale Features?
• Min-max normalization
x ′d =xd −minn(xd )
maxn xd −minn xd
The min and max are taken over the possible values x(1)d , . . . x
(N)d of
xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization
x ′d =xd − avg(xd )
maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1
Several other methods: eg. dividing by standard deviation (Z-score
normalization) Labels y (1), . . . y (N) should be similarly re-scaled
37
-
Ridge regression
-
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
38
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
What if X>X is not invertible?
wLMS =(X>X
)−1X>y
Why might this happen?
• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank
• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution
is not unique. Examples:
• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively
• Same feature is repeated twice – could happen when there are manyfeatures
• A feature has the same value for all data points• Sum of two features is equal to a third feature
39
-
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
-
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
-
Example: Matrix X>X is not invertible
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
Design matrix and target vector:
X =
1 1 2
1 2 2
1 1.5 2
1 2.5 2
, w =w0w1w2
, y =
2
3.5
3
4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
40
-
What does the RSS loss function look like?
• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point
In our example, this line is w0,eff = (w0 + 2w2)41
-
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features
• But this can be tedious and non-trivial, especially when a feature isa linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
-
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
-
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
-
How do you fix this issue?
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is
a linear combination of several other features
Need a general way that doesn’t require manual feature engineering
SOLUTION: Ridge Regression
42
-
Ridge regression
Intuition: what does a non-invertible X>X mean?
Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
-
Ridge regression
Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
-
Ridge regression
Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:
X>X = V
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
V>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1
Fix the problem: ensure all singular values are non-zero:
X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>
where λ > 0 and I is the identity matrix.
43
-
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
-
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
-
Regularized least square (ridge regression)
Solution
w =(X>X + λI
)−1X>y
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later
44
-
Applying this to our example
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
45
-
Applying this to our example
sqft (1000’s) bathrooms sale price (100k)
1 2 2
2 2 3.5
1.5 2 3
2.5 2 4.5
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
45
-
Applying this to our example
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
Compute the solution for λ = 0.5w0w1w2
= (X>X + λI)−1 X>yw0w1w2
= 0.2081.247
0.4166
46
-
Applying this to our example
The ’bathrooms’ feature is redundant, so we don’t need w2
y = w0 + w1x1 + w2x2
= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)
= 0.45 + 1.6x1 Should get this
Compute the solution for λ = 0.5w0w1w2
= (X>X + λI)−1 X>yw0w1w2
= 0.2081.247
0.4166
46
-
How does λ affect the solution?
w0w1w2
= (X>X + λI)−1 X>yLet us plot w ′o = w0 + 2w2 and w1 for different λ ∈ [0.01, 20]
0 5 10 15 20Hyperparameter λ
0.50
0.75
1.00
1.25
1.50
Par
amet
erV
alu
es
w0,effw1
Setting small λ gives almost the least-squares solution, but it can cause
numerical instability in the inversion47
-
How to choose λ?
λ is referred as hyperparameter
• Associated with the estimation method, not the dataset• In contrast w is the parameter vector• Use validation set or cross-validation to find good choice of λ (more
on this in the next lecture)
0 5 10 15 20Hyperparameter λ
0.50
0.75
1.00
1.25
1.50
Par
amet
erV
alu
es
w0,effw1
48
-
Why is it called Ridge Regression?
• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point
• Adding the regularizer term 12λ‖w‖22 yields a unique minimum, thusavoiding instability in matrix inversion
49
-
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
-
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
-
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
-
Probabilistic Interpretation of Ridge Regression
Add a term to the objective function.
• Choose the parameters to not just minimize risk, but avoid beingtoo large.
1
2
{w>X>Xw − 2
(X>y
)>w}
+1
2λ‖w‖22
Probabilistic interpretation: Place a prior on our weights
• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w
Gaussian priors lead to ridge regression.
50
-
Review: Probabilistic interpretation of Linear Regression
Linear Regression model: Y = w>X + η
η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)
Frequentist interpretation: We assume that w is fixed.
• The likelihood function maps parameters to probabilities
L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏
n
p(yn|xn,w , σ20)
• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:
wLMS = wML = arg maxw L(w , σ20)
51
-
Review: Probabilistic interpretation of Linear Regression
Linear Regression model: Y = w>X + η
η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)
Frequentist interpretation: We assume that w is fixed.
• The likelihood function maps parameters to probabilities
L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏
n
p(yn|xn,w , σ20)
• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:
wLMS = wML = arg maxw L(w , σ20)
51
-
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
-
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
-
Probabilistic interpretation of Ridge Regression
Ridge Regression model: Y = w>X + η
• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2
• To find w given data D, compute the posterior distribution of w :
p(w |D) = p(D|w)p(w)p(D)
• Maximum a posterior (MAP) estimate:
wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)
52
-
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).
Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
-
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
-
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
-
Estimating w
Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):
p(D,w) = p(D|w)p(w) =∏
n
p(yn|xn,w)∏
d
p(wd )
Plugging in the Gaussian PDF, we get:
log p(D,w) =∑
n
log p(yn|xn,w) +∑
d
log p(wd )
= −∑
n(w>xn − yn)22σ20
−∑
d
1
2σ2w2d + const
MAP estimate: wmap = arg maxw log p(D,w)
wmap = argminw
∑n(w
>xn − yn)22σ20
+1
2σ2‖w‖22
53
-
Maximum a posterior (MAP) estimate
E(w) =∑
n
(w>xn − yn)2 + λ‖w‖22
where λ > 0 is used to denote σ20/σ2. This extra term ‖w‖22 is called
regularization/regularizer and controls the magnitude of w .
Intuitions
• If λ→ +∞, then σ20 � σ2: the variance of noise is far greater than whatour prior model can allow for w . In this case, our prior model on w willforce w to be close to zero. Numerically,
wmap → 0
• If λ→ 0, then we trust our data more. Numerically,
wmap → w lms = argmin∑
n
(w>xn − yn)2
54
-
Outline
1. Review of Linear Regression
2. Gradient Descent Methods
3. Feature Scaling
4. Ridge regression
5. Non-linear Basis Functions
6. Overfitting
55
-
Non-linear Basis Functions
-
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
56
-
Is a linear modeling assumption always a good idea?
Figure 1: Sale price can saturate as sq.footage increases
x
t
0 1
−1
0
1
Figure 2: Temperature has cyclic variations over each year
57
-
General nonlinear basis functions
We can use a nonlinear mapping to a new feature vector:
φ(x) : x ∈ RD → z ∈ RM
• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D
We can apply existing learning methods on the transformed data:
• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc
58
-
General nonlinear basis functions
We can use a nonlinear mapping to a new feature vector:
φ(x) : x ∈ RD → z ∈ RM
• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D
We can apply existing learning methods on the transformed data:
• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc
58
-
Regression with nonlinear basis
Residual sum of squares∑n
[w>φ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed featuresφ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)>
φ(x2)>...
φ(xN )>
∈ RN×M , w lms =(
Φ>Φ)−1
Φ>y
59
-
Regression with nonlinear basis
Residual sum of squares∑n
[w>φ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed featuresφ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)>
φ(x2)>...
φ(xN )>
∈ RN×M , w lms =(
Φ>Φ)−1
Φ>y
59
-
Example: Lot of Flexibility in Designing New Features!
x1, Area (1k sqft) x21 , Area
2 Price (100k)
1 1 2
2 4 3.5
1.5 2.25 3
2.5 6.25 4.5
Figure 3: Add x21 as a feature to allow us to fit quadratic, instead of linear
functions of the house area x1
60
-
Example: Lot of Flexibility in Designing New Features!
x1, front (100ft) x2 depth (100ft) 10x1x2, Lot (1k sqft) Price (100k)
0.5 0.5 2.5 2
0.5 1 5 3.5
0.8 1.5 12 3
1.0 1.5 15 4.5
Figure 4: Instead of having frontage and depth as two separate features, it
may be better to consider the lot-area, which is equal to frontage×depth61
-
Example with regression
Polynomial basis functions
φ(x) =
1
x
x2
...
xM
⇒ f (x) = w0 +M∑
m=1
wmxm
Fitting samples from a sine function:
underfitting since f (x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
62
-
Example with regression
Polynomial basis functions
φ(x) =
1
x
x2
...
xM
⇒ f (x) = w0 +M∑
m=1
wmxm
Fitting samples from a sine function:
underfitting since f (x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
62
-
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, but
potentially worse results on new data, e.g., test data!
63
-
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, but
potentially worse results on new data, e.g., test data!
63
-
Overfitting
-
Outline
Review of Linear Regression
Gradient Descent Methods
Feature Scaling
Ridge regression
Non-linear Basis Functions
Overfitting
64
-
Overfitting
Parameters for higher-order polynomials are very large
M = 0 M = 1 M = 3 M = 9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
65
-
Overfitting can be quite disastrous
Fitting the housing price data with large M:
Predicted price goes to zero (and is ultimately negative) if you buy a big
enough house!
This is called poor generalization/overfitting.
66
-
Detecting overfitting
Plot model complexity versus objective function:
• X axis: model complexity, e.g., M• Y axis: error, e.g., RSS, RMS
(square root of RSS), 0-1 loss
Compute the objective on a training and
test dataset.
M
ERMS
0 3 6 90
0.5
1TrainingTest
As a model increases in complexity:
• Training error keeps improving• Test error may first improve but eventually will deteriorate
67
-
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
-
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
-
Dealing with overfitting
Try to use more training data
x
t
M = 9
0 1
−1
0
1
x
t
N = 15
0 1
−1
0
1
x
t
N = 100
0 1
−1
0
1
What if we do not have a lot of data?
68
-
Regularization methods
Intuition: Give preference to ‘simpler’ models
• How do we define a simple linear regression model — w>x?• Intuitively, the weights should not be “too large”
M = 0 M = 1 M = 3 M = 9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
69
-
Next Class: Overfitting, Regularization and
the Bias-variance trade-off
69
Review of Linear RegressionGradient Descent MethodsFeature ScalingRidge regressionNon-linear Basis FunctionsOverfitting