Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....

Transcript of Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....

18-661 Introduction to Machine Learning

Linear Regression – II

Spring 2020

ECE – Carnegie Mellon University
Announcements

• Homework 1 due today.

• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency

• The class waitlist is almost clear now. Any questions aboutregistration?

• The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

1
Announcements

• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR

• Python code will be graded for correctness and not efficiency



1
Announcements

• Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency



1
Announcements




1
Announcements




1
Today’s Class: Practical Issues with Using

Linear Regression and How to Address Them

1
Outline

1. Review of Linear Regression

2. Gradient Descent Methods

3. Feature Scaling

4. Ridge regression

5. Non-linear Basis Functions

6. Overfitting

2
Review of Linear Regression
Example: Predicting house prices

Sale price ≈ price per sqft × square footage + fixed expense

3
Minimize squared errors

Our model:

Sale price =

price per sqft× square footage + fixed expense + unexplainable stuffTraining data:

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

Aim:

Adjust price per sqft and fixed expense such that the sum of the squared

error is minimized — i.e., the unexplainable stuff is minimized.

4
Linear regression

Setup:

• Input: x ∈ RD (covariates, predictors, features, etc)• Output: y ∈ R (responses, targets, outcomes, outputs, etc)• Model: f : x→ y , with f (x) = w0 +

∑Dd=1 wdxd = w0 + w

>x.

• w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector• w0 is called bias.• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.

• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

Minimize the Residual sum of squares:

RSS(w) =N∑

n=1

[yn − f (xn)]2 =N∑

n=1

[yn − (w0 +D∑

d=1

wdxnd )]2

5
A simple case: x is just one-dimensional (D=1)

Residual sum of squares:

RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

6


RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

∑n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

∑n

[yn − (w0 + w1xn)]xn = 0.

7


RSS(w) =∑

n

[yn − f (xn)]2 =∑

n

[yn − (w0 + w1xn)]2

Stationary points:

Take derivative with respect to parameters and set it to zero

∂RSS(w)

∂w0= 0⇒ −2

∑n

[yn − (w0 + w1xn)] = 0,

∂RSS(w)

∂w1= 0⇒ −2

∑n

[yn − (w0 + w1xn)]xn = 0.

7

∂RSS(w)

∂w0= 0⇒ −2

∑n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

∑n

[yn − (w0 + w1xn)]xn = 0

Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1

∑xn∑

xnyn = w0∑

xn + w1∑

x2n

Solving the system we obtain the least squares coefficient estimates:

w1 =

∑(xn − x̄)(yn − ȳ)∑

(xi − x̄)2and w0 = ȳ − w1x̄

where x̄ = 1N∑

n xn and ȳ =1N

∑n yn.

8

∂RSS(w)

∂w0= 0⇒ −2

∑n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

∑n

[yn − (w0 + w1xn)]xn = 0


∑xn∑

xnyn = w0∑

xn + w1∑

x2n


w1 =

∑(xn − x̄)(yn − ȳ)∑

(xi − x̄)2and w0 = ȳ − w1x̄

where x̄ = 1N∑

n xn and ȳ =1N

∑n yn.

8

∂RSS(w)

∂w0= 0⇒ −2

∑n

[yn − (w0 + w1xn)] = 0

∂RSS(w)

∂w1= 0⇒ −2

∑n

[yn − (w0 + w1xn)]xn = 0


∑xn∑

xnyn = w0∑

xn + w1∑

x2n


w1 =

∑(xn − x̄)(yn − ȳ)∑

(xi − x̄)2and w0 = ȳ − w1x̄

where x̄ = 1N∑

n xn and ȳ =1N

∑n yn.

8
Least Mean Squares when x is D-dimensional

RSS(w) in matrix form:

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd )]2 =

∑n

[yn −w>xn]2,

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

Design matrix and target vector:

X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RNCompact expression:

RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

9


RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd )]2 =

∑n

[yn −w>xn]2,


x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>


X =

x>1x>2...

x>N

∈ RN×(D+1)

, y =

y1y2...

yN


RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

9


RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd )]2 =

∑n

[yn −w>xn]2,


x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>


X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN

Compact expression:

RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

9


RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd )]2 =

∑n

[yn −w>xn]2,


x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>


X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN


RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

9
Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5


X =

x>1x>2...

x>N

∈ RN×(D+1), y =

y1y2...

yN

∈ RN. Compact expression:

RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

10
Example: RSS(w) in compact form

sqft (1000’s) bedrooms bathrooms sale price (100k)

1 2 1 2

2 2 2 3.5

1.5 3 2 3

2.5 4 2.5 4.5


X =

x>1x>2...

x>N

=

1 1 2 1

1 2 2 2

1 1.5 3 2

1 2.5 4 2.5

, y =

2

3.5

3

4.5

. Compact expression:

RSS(w) = ‖Xw − y‖22 ={

w>X>Xw − 2(X>y

)>w}

+ const

11
Three Optimization Methods

Want to Minimize

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const

• Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent

12
Least-Squares Solution

Compact expression

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const

Gradients of Linear and Quadratic Functions

• ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)

Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0

This leads to the least-mean-squares (LMS) solution

wLMS =(X>X

)−1X>y

13

Compact expression

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const



Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0


wLMS =(X>X

)−1X>y

13

Compact expression

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const



Normal equation

∇wRSS(w) = 2X>Xw − 2X>y = 0


wLMS =(X>X

)−1X>y

13
Gradient Descent Methods
Outline



Feature Scaling

Ridge regression

Non-linear Basis Functions

Overfitting

14

Want to Minimize

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const


15
Computational complexity

Bottleneck of computing the solution?

w =(X>X

)−1X>y

How many operations do we need?

• O(ND2) for matrix multiplication X>X

• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion of X>X

• O(ND) for matrix multiplication X>y

• O(D2) for(X>X

)−1times X>y

O(ND2) + O(D3) – Impractical for very large D or N

16


w =(X>X

)−1X>y


• O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

theoretical advances) for matrix inversion of X>X


• O(D2) for(X>X

)−1times X>y


16


w =(X>X

)−1X>y





• O(D2) for(X>X

)−1times X>y


16


w =(X>X

)−1X>y





• O(D2) for(X>X

)−1times X>y


16


w =(X>X

)−1X>y





• O(D2) for(X>X

)−1times X>y


16


w =(X>X

)−1X>y





• O(D2) for(X>X

)−1times X>y

O(ND2) + O(D3)

– Impractical for very large D or N

16


w =(X>X

)−1X>y





• O(D2) for(X>X

)−1times X>y


16
Alternative method: Batch Gradient Descent

(Batch) Gradient descent

• Initialize w to w (0) (e.g., randomly);set t = 0; choose η > 0

• Loop until convergence1. Compute the gradient

∇RSS(w) = X>(Xw (t) − y)2. Update the parameters

w (t+1) = w (t) − η∇RSS(w)3. t ← t + 1

What is the complexity of each iteration?

O(ND)

17
Why would this work?

If gradient descent converges, it will converge to the same solution as

using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = w>X>Xw − 2(X>y

)>w + const

⇒ ∂2RSS(w)∂ww>

= 2X>X

X>X is positive semidefinite, because for any v

v>X>Xv = ‖X>v‖22 ≥ 0

18

Want to Minimize

RSS(w) = ||Xw − y||22 ={

w>X>Xw − 2(X>y

)>w}

+ const


19
Stochastic gradient descent (SGD)

Widrow-Hoff rule: update parameters using one example at a time

• Initialize w to some w (0); set t = 0; choose η > 0• Loop until convergence

1. random choose a training a sample x t2. Compute its contribution to the gradient

g t = (x>t w

(t) − yt)xt

3. Update the parameters

w (t+1) = w (t) − ηg t4. t ← t + 1

How does the complexity per iteration compare with gradient descent?

• O(ND) for gradient descent versus O(D) for SGD

20
SGD versus Batch GD

• SGD reduces per-iteration complexity from O(ND) to O(D)• But it is noisier and can take longer to converge

21
Example: Comparing the Three Methods

sqft (1000’s) sale price (100k)

1 2

2 3.5

1.5 3

2.5 4.5

regressiondata intelligence

= ??

house size

pirc

e ($

)

22
Example: Least Squares Solution


1 2

2 3.5

1.5 3

2.5 4.5

The w0 and w1 that minimize this are given by:

wLMS =(X>X

)−1X>y

[w0w1

]=

[

1 1 1 1

1 2 1.5 2.5

]1 1

1 2

1 1.5

1 2.5

−1 [

1 1 1 1

1 2 1.5 2.5

]2

3.5

3

4.5

23
Example: Least Squares Solution


1 2

2 3.5

1.5 3

2.5 4.5

The w0 and w1 that minimize this are given by:

wLMS =(X>X

)−1X>y

[w0w1

]=

[

1 1 1 1

1 2 1.5 2.5

]1 1

1 2

1 1.5

1 2.5

−1 [

1 1 1 1

1 2 1.5 2.5

]2

3.5

3

4.5

[w0w1

]=

[0.45

1.6

]Minimum RSS is RSS∗ = ||XwLMS − y||22 = 0.2236

24
Example: Batch Gradient Descent


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

Xw (t) − y)

0 20 40Number of Iterations

0.5

1.0

1.5

RS

SV

alu

e

η = 0.01

25
Larger η gives faster convergence


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

Xw (t) − y)


0.5

1.0

1.5

RS

SV

alu

e

η = 0.01

η = 0.1

26
But too large η makes GD unstable


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

Xw (t) − y)


0

20

40

RS

SV

alu

e

η = 0.01

η = 0.1

η = 0.12

27
Example: Stochastic Gradient Descent


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

x>t w(t) − y

)xt


0.5

1.0

1.5

2.0

RS

SV

alu

e

η = 0.05

28
Larger η gives faster convergence


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

x>t w(t) − y

)xt


0.5

1.0

1.5

2.0

RS

SV

alu

e

η = 0.05

η = 0.1

29
But too large η makes SGD unstable


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

x>t w(t) − y

)xt


0.5

1.0

1.5

2.0

RS

SV

alu

e

η = 0.05

η = 0.1

η = 0.25

30
How to Choose Learning Rate η in practice?

• Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more onthis later) and choose the one that gives fastest, stable convergence

• Reduce η by a constant factor (eg. 10) when learning saturates sothat we can reach closer to the true minimum.

• More advanced learning rate schedules such as AdaGrad, Adam,AdaDelta are used in practice.

31
Summary of Gradient Descent Methods

• Batch gradient descent computes the exact gradient.• Stochastic gradient descent approximates the gradient with a single

data point; its expectation equals the true gradient.

• Mini-batch variant: set the batch size to trade-off between accuracyof estimating gradient and computational cost

• Similar ideas extend to other ML optimization problems.

32
Feature Scaling
Outline



Feature Scaling

Ridge regression


Overfitting

33
Batch Gradient Descent: Scaled Features


1 2

2 3.5

1.5 3

2.5 4.5

w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

Xw (t) − y)


0.5

1.0

1.5

RS

SV

alu

e

η = 0.01

η = 0.1

34
Batch Gradient Descent: Without Feature Scaling

sqft sale price

1000 200,000

2000 350,000

1500 300,000

2500 450,000

• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)

• ∇RSS(w (t)) = X>(Xw (t) − y

)becomes HUGE, causing instability

• We need a tiny η to compensate, but this leads to slow convergence


50000

100000

150000

200000

RS

SV

alu

e

η = 0.0000001

35

sqft sale price

1000 200,000

2000 350,000

1500 300,000

2500 450,000

• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>

(Xw (t) − y




50000

100000

150000

200000

RS

SV

alu

e

η = 0.0000001

35

sqft sale price

1000 200,000

2000 350,000

1500 300,000

2500 450,000

• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>

(Xw (t) − y




50000

100000

150000

200000

RS

SV

alu

e

η = 0.0000001

35

sqft sale price

1000 200,000

2000 350,000

1500 300,000

2500 450,000

• Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w) becomes HUGE, causing instability• We need a tiny η to compensate, but this leads to slow convergence


0.00

0.25

0.50

0.75

1.00

RS

SV

alu

e

×10110

η = 0.0000001

η = 0.00001

36
How to Scale Features?

• Min-max normalization

x ′d =xd −minn(xd )

maxn xd −minn xd

The min and max are taken over the possible values x(1)d , . . . x

(N)d of

xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

x ′d =xd − avg(xd )

maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

Several other methods: eg. dividing by standard deviation (Z-score

normalization) Labels y (1), . . . y (N) should be similarly re-scaled

37



maxn xd −minn xd


(N)d of

xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1

• Mean normalization





37



maxn xd −minn xd


(N)d of






37



maxn xd −minn xd


(N)d of






37



maxn xd −minn xd


(N)d of





normalization)

Labels y (1), . . . y (N) should be similarly re-scaled

37



maxn xd −minn xd


(N)d of






37
Ridge regression
Outline



Feature Scaling

Ridge regression


Overfitting

38
What if X>X is not invertible?

wLMS =(X>X

)−1X>y

Why might this happen?

• Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

• Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

is not unique. Examples:

• A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

• Same feature is repeated twice – could happen when there are manyfeatures

• A feature has the same value for all data points• Sum of two features is equal to a third feature

39

wLMS =(X>X

)−1X>y








39

wLMS =(X>X

)−1X>y








39

wLMS =(X>X

)−1X>y








39

wLMS =(X>X

)−1X>y








39

wLMS =(X>X

)−1X>y








39
Example: Matrix X>X is not invertible

sqft (1000’s) bathrooms sale price (100k)

1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5


X =

1 1 2

1 2 2

1 1.5 2

1 2.5 2

, w =w0w1w2

, y =

2

3.5

3

4.5

The ’bathrooms’ feature is redundant, so we don’t need w2

y = w0 + w1x1 + w2x2

= w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

40


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5


X =

1 1 2

1 2 2

1 1.5 2

1 2.5 2

, w =w0w1w2

, y =

2

3.5

3

4.5


y = w0 + w1x1 + w2x2


40


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5


X =

1 1 2

1 2 2

1 1.5 2

1 2.5 2

, w =w0w1w2

, y =

2

3.5

3

4.5


y = w0 + w1x1 + w2x2


40
What does the RSS loss function look like?

• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point

In our example, this line is w0,eff = (w0 + 2w2)41
How do you fix this issue?


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5

• Manually remove redundant features

• But this can be tedious and non-trivial, especially when a feature isa linear combination of several other features

Need a general way that doesn’t require manual feature engineering

SOLUTION: Ridge Regression

42


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5

• Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is

a linear combination of several other features



42


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5





42


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5





42
Ridge regression

Intuition: what does a non-invertible X>X mean?

Consider the SVD of this matrix:

X>X = V

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

V>

where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1

Fix the problem: ensure all singular values are non-zero:

X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>

where λ > 0 and I is the identity matrix.

43
Ridge regression

Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:

X>X = V

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

V>





43
Ridge regression

Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:

X>X = V

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

V>





43
Regularized least square (ridge regression)

Solution

w =(X>X + λI

)−1X>y

This is equivalent to adding an extra term to RSS(w)

RSS(w)︷︸︸︷1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22︸︷︷︸

regularization

Benefits

• Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later

44

Solution

w =(X>X + λI

)−1X>y


RSS(w)︷︸︸︷1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22︸︷︷︸

regularization

Benefits


44

Solution

w =(X>X + λI

)−1X>y


RSS(w)︷︸︸︷1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22︸︷︷︸

regularization

Benefits


44
Applying this to our example


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5


y = w0 + w1x1 + w2x2


= 0.45 + 1.6x1 Should get this

45


1 2 2

2 2 3.5

1.5 2 3

2.5 2 4.5


y = w0 + w1x1 + w2x2



45


y = w0 + w1x1 + w2x2



Compute the solution for λ = 0.5w0w1w2

= (X>X + λI)−1 X>yw0w1w2

= 0.2081.247

0.4166

46


y = w0 + w1x1 + w2x2



Compute the solution for λ = 0.5w0w1w2

= (X>X + λI)−1 X>yw0w1w2

= 0.2081.247

0.4166

46
How does λ affect the solution?

w0w1w2

= (X>X + λI)−1 X>yLet us plot w ′o = w0 + 2w2 and w1 for different λ ∈ [0.01, 20]

0 5 10 15 20Hyperparameter λ

0.50

0.75

1.00

1.25

1.50

Par

amet

erV

alu

es

w0,effw1

Setting small λ gives almost the least-squares solution, but it can cause

numerical instability in the inversion47
How to choose λ?

λ is referred as hyperparameter

• Associated with the estimation method, not the dataset• In contrast w is the parameter vector• Use validation set or cross-validation to find good choice of λ (more

on this in the next lecture)

0 5 10 15 20Hyperparameter λ

0.50

0.75

1.00

1.25

1.50

Par

amet

erV

alu

es

w0,effw1

48
Why is it called Ridge Regression?

• When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point

• Adding the regularizer term 12λ‖w‖22 yields a unique minimum, thusavoiding instability in matrix inversion

49
Probabilistic Interpretation of Ridge Regression

Add a term to the objective function.

• Choose the parameters to not just minimize risk, but avoid beingtoo large.

1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22

Probabilistic interpretation: Place a prior on our weights

• Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w

Gaussian priors lead to ridge regression.

50



1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22




50



1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22




50



1

2

{w>X>Xw − 2

(X>y

)>w}

+1

2λ‖w‖22




50
Review: Probabilistic interpretation of Linear Regression

Linear Regression model: Y = w>X + η

η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)

Frequentist interpretation: We assume that w is fixed.

• The likelihood function maps parameters to probabilities

L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏

n

p(yn|xn,w , σ20)

• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:

wLMS = wML = arg maxw L(w , σ20)

51
Review: Probabilistic interpretation of Linear Regression

Linear Regression model: Y = w>X + η

η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)

Frequentist interpretation: We assume that w is fixed.

• The likelihood function maps parameters to probabilities

L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏

n

p(yn|xn,w , σ20)

• Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:

wLMS = wML = arg maxw L(w , σ20)

51
Probabilistic interpretation of Ridge Regression

Ridge Regression model: Y = w>X + η

• Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2

• To find w given data D, compute the posterior distribution of w :

p(w |D) = p(D|w)p(w)p(D)

• Maximum a posterior (MAP) estimate:

wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)

52







52







52
Estimating w

Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).

Joint likelihood of data and parameters (given σ0, σ):

p(D,w) = p(D|w)p(w) =∏

n

p(yn|xn,w)∏

d

p(wd )

Plugging in the Gaussian PDF, we get:

log p(D,w) =∑

n

log p(yn|xn,w) +∑

d

log p(wd )

= −∑

n(w>xn − yn)22σ20

−∑

d

1

2σ2w2d + const

MAP estimate: wmap = arg maxw log p(D,w)

wmap = argminw

∑n(w

>xn − yn)22σ20

+1

2σ2‖w‖22

53
Estimating w

Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):

p(D,w) = p(D|w)p(w) =∏

n

p(yn|xn,w)∏

d

p(wd )


log p(D,w) =∑

n

log p(yn|xn,w) +∑

d

log p(wd )

= −∑


−∑

d

1

2σ2w2d + const


wmap = argminw

∑n(w

>xn − yn)22σ20

+1

2σ2‖w‖22

53
Estimating w


p(D,w) = p(D|w)p(w) =∏

n

p(yn|xn,w)∏

d

p(wd )


log p(D,w) =∑

n

log p(yn|xn,w) +∑

d

log p(wd )

= −∑


−∑

d

1

2σ2w2d + const


wmap = argminw

∑n(w

>xn − yn)22σ20

+1

2σ2‖w‖22

53
Estimating w


p(D,w) = p(D|w)p(w) =∏

n

p(yn|xn,w)∏

d

p(wd )


log p(D,w) =∑

n

log p(yn|xn,w) +∑

d

log p(wd )

= −∑


−∑

d

1

2σ2w2d + const


wmap = argminw

∑n(w

>xn − yn)22σ20

+1

2σ2‖w‖22

53
Maximum a posterior (MAP) estimate

E(w) =∑

n

(w>xn − yn)2 + λ‖w‖22

where λ > 0 is used to denote σ20/σ2. This extra term ‖w‖22 is called

regularization/regularizer and controls the magnitude of w .

Intuitions

• If λ→ +∞, then σ20 � σ2: the variance of noise is far greater than whatour prior model can allow for w . In this case, our prior model on w willforce w to be close to zero. Numerically,

wmap → 0

• If λ→ 0, then we trust our data more. Numerically,

wmap → w lms = argmin∑

n

(w>xn − yn)2

54
Outline

1. Review of Linear Regression

2. Gradient Descent Methods

3. Feature Scaling

4. Ridge regression

5. Non-linear Basis Functions

6. Overfitting

55
Outline



Feature Scaling

Ridge regression


Overfitting

56
Is a linear modeling assumption always a good idea?

Figure 1: Sale price can saturate as sq.footage increases

x

t

0 1

−1

0

1

Figure 2: Temperature has cyclic variations over each year

57
General nonlinear basis functions

We can use a nonlinear mapping to a new feature vector:

φ(x) : x ∈ RD → z ∈ RM

• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D

We can apply existing learning methods on the transformed data:

• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc

58
General nonlinear basis functions

We can use a nonlinear mapping to a new feature vector:

φ(x) : x ∈ RD → z ∈ RM

• M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D

We can apply existing learning methods on the transformed data:

• linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc

58
Regression with nonlinear basis

Residual sum of squares∑n

[w>φ(xn)− yn]2

where w ∈ RM , the same dimensionality as the transformed featuresφ(x).

The LMS solution can be formulated with the new design matrix

Φ =

φ(x1)>

φ(x2)>...

φ(xN )>

∈ RN×M , w lms =(

Φ>Φ)−1

Φ>y

59
Regression with nonlinear basis

Residual sum of squares∑n

[w>φ(xn)− yn]2

where w ∈ RM , the same dimensionality as the transformed featuresφ(x).

The LMS solution can be formulated with the new design matrix

Φ =

φ(x1)>

φ(x2)>...

φ(xN )>

∈ RN×M , w lms =(

Φ>Φ)−1

Φ>y

59
Example: Lot of Flexibility in Designing New Features!

x1, Area (1k sqft) x21 , Area

2 Price (100k)

1 1 2

2 4 3.5

1.5 2.25 3

2.5 6.25 4.5

Figure 3: Add x21 as a feature to allow us to fit quadratic, instead of linear

functions of the house area x1

60
Example: Lot of Flexibility in Designing New Features!

x1, front (100ft) x2 depth (100ft) 10x1x2, Lot (1k sqft) Price (100k)

0.5 0.5 2.5 2

0.5 1 5 3.5

0.8 1.5 12 3

1.0 1.5 15 4.5

Figure 4: Instead of having frontage and depth as two separate features, it

may be better to consider the lot-area, which is equal to frontage×depth61
Example with regression

Polynomial basis functions

φ(x) =

1

x

x2

...

xM

⇒ f (x) = w0 +M∑

m=1

wmxm

Fitting samples from a sine function:

underfitting since f (x) is too simple

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

62
Example with regression

Polynomial basis functions

φ(x) =

1

x

x2

...

xM

⇒ f (x) = w0 +M∑

m=1

wmxm

Fitting samples from a sine function:

underfitting since f (x) is too simple

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

62
Adding high-order terms

M=3

x

t

M = 3

0 1

−1

0

1

M=9: overfitting

x

t

M = 9

0 1

−1

0

1

More complex features lead to better results on the training data, but

potentially worse results on new data, e.g., test data!

63
Adding high-order terms

M=3

x

t

M = 3

0 1

−1

0

1

M=9: overfitting

x

t

M = 9

0 1

−1

0

1

More complex features lead to better results on the training data, but

potentially worse results on new data, e.g., test data!

63
Overfitting
Outline



Feature Scaling

Ridge regression


Overfitting

64
Overfitting

Parameters for higher-order polynomials are very large

M = 0 M = 1 M = 3 M = 9

w0 0.19 0.82 0.31 0.35

w1 -1.27 7.99 232.37

w2 -25.43 -5321.83

w3 17.37 48568.31

w4 -231639.30

w5 640042.26

w6 -1061800.52

w7 1042400.18

w8 -557682.99

w9 125201.43

65
Overfitting can be quite disastrous

Fitting the housing price data with large M:

Predicted price goes to zero (and is ultimately negative) if you buy a big

enough house!

This is called poor generalization/overfitting.

66
Detecting overfitting

Plot model complexity versus objective function:

• X axis: model complexity, e.g., M• Y axis: error, e.g., RSS, RMS

(square root of RSS), 0-1 loss

Compute the objective on a training and

test dataset.

M

ERMS

0 3 6 90

0.5

1TrainingTest

As a model increases in complexity:

• Training error keeps improving• Test error may first improve but eventually will deteriorate

67
Dealing with overfitting

Try to use more training data

x

t

M = 9

0 1

−1

0

1

x

t

N = 15

0 1

−1

0

1

x

t

N = 100

0 1

−1

0

1

What if we do not have a lot of data?

68


x

t

M = 9

0 1

−1

0

1

x

t

N = 15

0 1

−1

0

1

x

t

N = 100

0 1

−1

0

1


68


x

t

M = 9

0 1

−1

0

1

x

t

N = 15

0 1

−1

0

1

x

t

N = 100

0 1

−1

0

1


68
Regularization methods

Intuition: Give preference to ‘simpler’ models

• How do we define a simple linear regression model — w>x?• Intuitively, the weights should not be “too large”

M = 0 M = 1 M = 3 M = 9

w0 0.19 0.82 0.31 0.35

w1 -1.27 7.99 232.37

w2 -25.43 -5321.83

w3 17.37 48568.31

w4 -231639.30

w5 640042.26

w6 -1061800.52

w7 1042400.18

w8 -557682.99

w9 125201.43

69
Next Class: Overfitting, Regularization and

the Bias-variance trade-off

69

Review of Linear RegressionGradient Descent MethodsFeature ScalingRidge regressionNon-linear Basis FunctionsOverfitting

Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....

Documents

Transcript of Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....