Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....

134
18-661 Introduction to Machine Learning Linear Regression – II Spring 2020 ECE – Carnegie Mellon University

Transcript of Linear Regression { II · 2020. 4. 27. · Linear Regression and How to Address Them 1. Outline 1....

  • 18-661 Introduction to Machine Learning

    Linear Regression – II

    Spring 2020

    ECE – Carnegie Mellon University

  • Announcements

    • Homework 1 due today.

    • If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency

    • The class waitlist is almost clear now. Any questions aboutregistration?

    • The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

    1

  • Announcements

    • Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR

    • Python code will be graded for correctness and not efficiency

    • The class waitlist is almost clear now. Any questions aboutregistration?

    • The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

    1

  • Announcements

    • Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency

    • The class waitlist is almost clear now. Any questions aboutregistration?

    • The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

    1

  • Announcements

    • Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency

    • The class waitlist is almost clear now. Any questions aboutregistration?

    • The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

    1

  • Announcements

    • Homework 1 due today.• If you are not able to access Gradescope, the entry code is 9NEDVR• Python code will be graded for correctness and not efficiency

    • The class waitlist is almost clear now. Any questions aboutregistration?

    • The few classes will be taught by Prof. Carlee Joe-Wong andbroadcast from SV to Pittsburgh & Kigali

    1

  • Today’s Class: Practical Issues with Using

    Linear Regression and How to Address Them

    1

  • Outline

    1. Review of Linear Regression

    2. Gradient Descent Methods

    3. Feature Scaling

    4. Ridge regression

    5. Non-linear Basis Functions

    6. Overfitting

    2

  • Review of Linear Regression

  • Example: Predicting house prices

    Sale price ≈ price per sqft × square footage + fixed expense

    3

  • Minimize squared errors

    Our model:

    Sale price =

    price per sqft× square footage + fixed expense + unexplainable stuffTraining data:

    sqft sale price prediction error squared error

    2000 810K 720K 90K 8100

    2100 907K 800K 107K 1072

    1100 312K 350K 38K 382

    5500 2,600K 2,600K 0 0

    · · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

    Aim:

    Adjust price per sqft and fixed expense such that the sum of the squared

    error is minimized — i.e., the unexplainable stuff is minimized.

    4

  • Linear regression

    Setup:

    • Input: x ∈ RD (covariates, predictors, features, etc)• Output: y ∈ R (responses, targets, outcomes, outputs, etc)• Model: f : x→ y , with f (x) = w0 +

    ∑Dd=1 wdxd = w0 + w

    >x.

    • w = [w1 w2 · · · wD ]>: weights, parameters, or parameter vector• w0 is called bias.• Sometimes, we also call w = [w0 w1 w2 · · · wD ]> parameters.

    • Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

    Minimize the Residual sum of squares:

    RSS(w) =N∑

    n=1

    [yn − f (xn)]2 =N∑

    n=1

    [yn − (w0 +D∑

    d=1

    wdxnd )]2

    5

  • A simple case: x is just one-dimensional (D=1)

    Residual sum of squares:

    RSS(w) =∑

    n

    [yn − f (xn)]2 =∑

    n

    [yn − (w0 + w1xn)]2

    6

  • A simple case: x is just one-dimensional (D=1)

    Residual sum of squares:

    RSS(w) =∑

    n

    [yn − f (xn)]2 =∑

    n

    [yn − (w0 + w1xn)]2

    Stationary points:

    Take derivative with respect to parameters and set it to zero

    ∂RSS(w)

    ∂w0= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)] = 0,

    ∂RSS(w)

    ∂w1= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)]xn = 0.

    7

  • A simple case: x is just one-dimensional (D=1)

    Residual sum of squares:

    RSS(w) =∑

    n

    [yn − f (xn)]2 =∑

    n

    [yn − (w0 + w1xn)]2

    Stationary points:

    Take derivative with respect to parameters and set it to zero

    ∂RSS(w)

    ∂w0= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)] = 0,

    ∂RSS(w)

    ∂w1= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)]xn = 0.

    7

  • A simple case: x is just one-dimensional (D=1)

    ∂RSS(w)

    ∂w0= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)] = 0

    ∂RSS(w)

    ∂w1= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)]xn = 0

    Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1

    ∑xn∑

    xnyn = w0∑

    xn + w1∑

    x2n

    Solving the system we obtain the least squares coefficient estimates:

    w1 =

    ∑(xn − x̄)(yn − ȳ)∑

    (xi − x̄)2and w0 = ȳ − w1x̄

    where x̄ = 1N∑

    n xn and ȳ =1N

    ∑n yn.

    8

  • A simple case: x is just one-dimensional (D=1)

    ∂RSS(w)

    ∂w0= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)] = 0

    ∂RSS(w)

    ∂w1= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)]xn = 0

    Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1

    ∑xn∑

    xnyn = w0∑

    xn + w1∑

    x2n

    Solving the system we obtain the least squares coefficient estimates:

    w1 =

    ∑(xn − x̄)(yn − ȳ)∑

    (xi − x̄)2and w0 = ȳ − w1x̄

    where x̄ = 1N∑

    n xn and ȳ =1N

    ∑n yn.

    8

  • A simple case: x is just one-dimensional (D=1)

    ∂RSS(w)

    ∂w0= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)] = 0

    ∂RSS(w)

    ∂w1= 0⇒ −2

    ∑n

    [yn − (w0 + w1xn)]xn = 0

    Simplify these expressions to get the “Normal Equations”:∑yn = Nw0 + w1

    ∑xn∑

    xnyn = w0∑

    xn + w1∑

    x2n

    Solving the system we obtain the least squares coefficient estimates:

    w1 =

    ∑(xn − x̄)(yn − ȳ)∑

    (xi − x̄)2and w0 = ȳ − w1x̄

    where x̄ = 1N∑

    n xn and ȳ =1N

    ∑n yn.

    8

  • Least Mean Squares when x is D-dimensional

    RSS(w) in matrix form:

    RSS(w) =∑

    n

    [yn − (w0 +∑

    d

    wdxnd )]2 =

    ∑n

    [yn −w>xn]2,

    where we have redefined some variables (by augmenting)

    x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    ∈ RN×(D+1), y =

    y1y2...

    yN

    ∈ RNCompact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    9

  • Least Mean Squares when x is D-dimensional

    RSS(w) in matrix form:

    RSS(w) =∑

    n

    [yn − (w0 +∑

    d

    wdxnd )]2 =

    ∑n

    [yn −w>xn]2,

    where we have redefined some variables (by augmenting)

    x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    ∈ RN×(D+1)

    , y =

    y1y2...

    yN

    ∈ RNCompact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    9

  • Least Mean Squares when x is D-dimensional

    RSS(w) in matrix form:

    RSS(w) =∑

    n

    [yn − (w0 +∑

    d

    wdxnd )]2 =

    ∑n

    [yn −w>xn]2,

    where we have redefined some variables (by augmenting)

    x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    ∈ RN×(D+1), y =

    y1y2...

    yN

    ∈ RN

    Compact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    9

  • Least Mean Squares when x is D-dimensional

    RSS(w) in matrix form:

    RSS(w) =∑

    n

    [yn − (w0 +∑

    d

    wdxnd )]2 =

    ∑n

    [yn −w>xn]2,

    where we have redefined some variables (by augmenting)

    x← [1 x1 x2 . . . xD ]>, w← [w0 w1 w2 . . . wD ]>

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    ∈ RN×(D+1), y =

    y1y2...

    yN

    ∈ RNCompact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    9

  • Example: RSS(w) in compact form

    sqft (1000’s) bedrooms bathrooms sale price (100k)

    1 2 1 2

    2 2 2 3.5

    1.5 3 2 3

    2.5 4 2.5 4.5

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    ∈ RN×(D+1), y =

    y1y2...

    yN

    ∈ RN. Compact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    10

  • Example: RSS(w) in compact form

    sqft (1000’s) bedrooms bathrooms sale price (100k)

    1 2 1 2

    2 2 2 3.5

    1.5 3 2 3

    2.5 4 2.5 4.5

    Design matrix and target vector:

    X =

    x>1x>2...

    x>N

    =

    1 1 2 1

    1 2 2 2

    1 1.5 3 2

    1 2.5 4 2.5

    , y =

    2

    3.5

    3

    4.5

    . Compact expression:

    RSS(w) = ‖Xw − y‖22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    11

  • Three Optimization Methods

    Want to Minimize

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    • Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent

    12

  • Least-Squares Solution

    Compact expression

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    Gradients of Linear and Quadratic Functions

    • ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)

    Normal equation

    ∇wRSS(w) = 2X>Xw − 2X>y = 0

    This leads to the least-mean-squares (LMS) solution

    wLMS =(X>X

    )−1X>y

    13

  • Least-Squares Solution

    Compact expression

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    Gradients of Linear and Quadratic Functions

    • ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)

    Normal equation

    ∇wRSS(w) = 2X>Xw − 2X>y = 0

    This leads to the least-mean-squares (LMS) solution

    wLMS =(X>X

    )−1X>y

    13

  • Least-Squares Solution

    Compact expression

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    Gradients of Linear and Quadratic Functions

    • ∇x(b>x) = b• ∇x(x>Ax) = 2Ax (symmetric A)

    Normal equation

    ∇wRSS(w) = 2X>Xw − 2X>y = 0

    This leads to the least-mean-squares (LMS) solution

    wLMS =(X>X

    )−1X>y

    13

  • Gradient Descent Methods

  • Outline

    Review of Linear Regression

    Gradient Descent Methods

    Feature Scaling

    Ridge regression

    Non-linear Basis Functions

    Overfitting

    14

  • Three Optimization Methods

    Want to Minimize

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    • Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent

    15

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X

    • O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3)

    – Impractical for very large D or N

    16

  • Computational complexity

    Bottleneck of computing the solution?

    w =(X>X

    )−1X>y

    How many operations do we need?

    • O(ND2) for matrix multiplication X>X• O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recent

    theoretical advances) for matrix inversion of X>X

    • O(ND) for matrix multiplication X>y

    • O(D2) for(X>X

    )−1times X>y

    O(ND2) + O(D3) – Impractical for very large D or N

    16

  • Alternative method: Batch Gradient Descent

    (Batch) Gradient descent

    • Initialize w to w (0) (e.g., randomly);set t = 0; choose η > 0

    • Loop until convergence1. Compute the gradient

    ∇RSS(w) = X>(Xw (t) − y)2. Update the parameters

    w (t+1) = w (t) − η∇RSS(w)3. t ← t + 1

    What is the complexity of each iteration?

    O(ND)

    17

  • Why would this work?

    If gradient descent converges, it will converge to the same solution as

    using matrix inversion.

    This is because RSS(w) is a convex function in its parameters w

    Hessian of RSS

    RSS(w) = w>X>Xw − 2(X>y

    )>w + const

    ⇒ ∂2RSS(w)∂ww>

    = 2X>X

    X>X is positive semidefinite, because for any v

    v>X>Xv = ‖X>v‖22 ≥ 0

    18

  • Three Optimization Methods

    Want to Minimize

    RSS(w) = ||Xw − y||22 ={

    w>X>Xw − 2(X>y

    )>w}

    + const

    • Least-Squares Solution; taking the derivative and setting it to zero• Batch Gradient Descent• Stochastic Gradient Descent

    19

  • Stochastic gradient descent (SGD)

    Widrow-Hoff rule: update parameters using one example at a time

    • Initialize w to some w (0); set t = 0; choose η > 0• Loop until convergence

    1. random choose a training a sample x t2. Compute its contribution to the gradient

    g t = (x>t w

    (t) − yt)xt

    3. Update the parameters

    w (t+1) = w (t) − ηg t4. t ← t + 1

    How does the complexity per iteration compare with gradient descent?

    • O(ND) for gradient descent versus O(D) for SGD

    20

  • SGD versus Batch GD

    • SGD reduces per-iteration complexity from O(ND) to O(D)• But it is noisier and can take longer to converge

    21

  • Example: Comparing the Three Methods

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    regressiondata intelligence

    = ??

    house size

    pirc

    e ($

    )

    22

  • Example: Least Squares Solution

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    The w0 and w1 that minimize this are given by:

    wLMS =(X>X

    )−1X>y

    [w0w1

    ]=

    [

    1 1 1 1

    1 2 1.5 2.5

    ]1 1

    1 2

    1 1.5

    1 2.5

    −1 [

    1 1 1 1

    1 2 1.5 2.5

    ]2

    3.5

    3

    4.5

    23

  • Example: Least Squares Solution

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    The w0 and w1 that minimize this are given by:

    wLMS =(X>X

    )−1X>y

    [w0w1

    ]=

    [

    1 1 1 1

    1 2 1.5 2.5

    ]1 1

    1 2

    1 1.5

    1 2.5

    −1 [

    1 1 1 1

    1 2 1.5 2.5

    ]2

    3.5

    3

    4.5

    [w0w1

    ]=

    [0.45

    1.6

    ]Minimum RSS is RSS∗ = ||XwLMS − y||22 = 0.2236

    24

  • Example: Batch Gradient Descent

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

    Xw (t) − y)

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    RS

    SV

    alu

    e

    η = 0.01

    25

  • Larger η gives faster convergence

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

    Xw (t) − y)

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    RS

    SV

    alu

    e

    η = 0.01

    η = 0.1

    26

  • But too large η makes GD unstable

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

    Xw (t) − y)

    0 20 40Number of Iterations

    0

    20

    40

    RS

    SV

    alu

    e

    η = 0.01

    η = 0.1

    η = 0.12

    27

  • Example: Stochastic Gradient Descent

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

    x>t w(t) − y

    )xt

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    2.0

    RS

    SV

    alu

    e

    η = 0.05

    28

  • Larger η gives faster convergence

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

    x>t w(t) − y

    )xt

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    2.0

    RS

    SV

    alu

    e

    η = 0.05

    η = 0.1

    29

  • But too large η makes SGD unstable

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − η(

    x>t w(t) − y

    )xt

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    2.0

    RS

    SV

    alu

    e

    η = 0.05

    η = 0.1

    η = 0.25

    30

  • How to Choose Learning Rate η in practice?

    • Try 0.0001, 0.001, 0.01, 0.1 etc. on a validation dataset (more onthis later) and choose the one that gives fastest, stable convergence

    • Reduce η by a constant factor (eg. 10) when learning saturates sothat we can reach closer to the true minimum.

    • More advanced learning rate schedules such as AdaGrad, Adam,AdaDelta are used in practice.

    31

  • Summary of Gradient Descent Methods

    • Batch gradient descent computes the exact gradient.• Stochastic gradient descent approximates the gradient with a single

    data point; its expectation equals the true gradient.

    • Mini-batch variant: set the batch size to trade-off between accuracyof estimating gradient and computational cost

    • Similar ideas extend to other ML optimization problems.

    32

  • Feature Scaling

  • Outline

    Review of Linear Regression

    Gradient Descent Methods

    Feature Scaling

    Ridge regression

    Non-linear Basis Functions

    Overfitting

    33

  • Batch Gradient Descent: Scaled Features

    sqft (1000’s) sale price (100k)

    1 2

    2 3.5

    1.5 3

    2.5 4.5

    w (t+1) = w (t) − η∇RSS(w) = w (t) − ηX>(

    Xw (t) − y)

    0 20 40Number of Iterations

    0.5

    1.0

    1.5

    RS

    SV

    alu

    e

    η = 0.01

    η = 0.1

    34

  • Batch Gradient Descent: Without Feature Scaling

    sqft sale price

    1000 200,000

    2000 350,000

    1500 300,000

    2500 450,000

    • Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)

    • ∇RSS(w (t)) = X>(Xw (t) − y

    )becomes HUGE, causing instability

    • We need a tiny η to compensate, but this leads to slow convergence

    0 20 40Number of Iterations

    50000

    100000

    150000

    200000

    RS

    SV

    alu

    e

    η = 0.0000001

    35

  • Batch Gradient Descent: Without Feature Scaling

    sqft sale price

    1000 200,000

    2000 350,000

    1500 300,000

    2500 450,000

    • Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>

    (Xw (t) − y

    )becomes HUGE, causing instability

    • We need a tiny η to compensate, but this leads to slow convergence

    0 20 40Number of Iterations

    50000

    100000

    150000

    200000

    RS

    SV

    alu

    e

    η = 0.0000001

    35

  • Batch Gradient Descent: Without Feature Scaling

    sqft sale price

    1000 200,000

    2000 350,000

    1500 300,000

    2500 450,000

    • Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w (t)) = X>

    (Xw (t) − y

    )becomes HUGE, causing instability

    • We need a tiny η to compensate, but this leads to slow convergence

    0 20 40Number of Iterations

    50000

    100000

    150000

    200000

    RS

    SV

    alu

    e

    η = 0.0000001

    35

  • Batch Gradient Descent: Without Feature Scaling

    sqft sale price

    1000 200,000

    2000 350,000

    1500 300,000

    2500 450,000

    • Least-squares solution is (w∗0 ,w∗1 ) = (45000, 160)• ∇RSS(w) becomes HUGE, causing instability• We need a tiny η to compensate, but this leads to slow convergence

    0 20 40Number of Iterations

    0.00

    0.25

    0.50

    0.75

    1.00

    RS

    SV

    alu

    e

    ×10110

    η = 0.0000001

    η = 0.00001

    36

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization) Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1

    • Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization) Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization) Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization) Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization)

    Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • How to Scale Features?

    • Min-max normalization

    x ′d =xd −minn(xd )

    maxn xd −minn xd

    The min and max are taken over the possible values x(1)d , . . . x

    (N)d of

    xd in the dataset. This will result in all scaled features 0 ≤ xd ≤ 1• Mean normalization

    x ′d =xd − avg(xd )

    maxn xd −minn xdThis will result in all scaled features −1 ≤ xd ≤ 1

    Several other methods: eg. dividing by standard deviation (Z-score

    normalization) Labels y (1), . . . y (N) should be similarly re-scaled

    37

  • Ridge regression

  • Outline

    Review of Linear Regression

    Gradient Descent Methods

    Feature Scaling

    Ridge regression

    Non-linear Basis Functions

    Overfitting

    38

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • What if X>X is not invertible?

    wLMS =(X>X

    )−1X>y

    Why might this happen?

    • Answer 1: N < D. Not enough data to estimate all parameters.X>X is not full-rank

    • Answer 2: Columns of X are not linearly independent, e.g., somefeatures are linear functions of other features. In this case, solution

    is not unique. Examples:

    • A feature is a re-scaled version of another, for example, having twofeatures correspond to length in meters and feet respectively

    • Same feature is repeated twice – could happen when there are manyfeatures

    • A feature has the same value for all data points• Sum of two features is equal to a third feature

    39

  • Example: Matrix X>X is not invertible

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    Design matrix and target vector:

    X =

    1 1 2

    1 2 2

    1 1.5 2

    1 2.5 2

    , w =w0w1w2

    , y =

    2

    3.5

    3

    4.5

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    40

  • Example: Matrix X>X is not invertible

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    Design matrix and target vector:

    X =

    1 1 2

    1 2 2

    1 1.5 2

    1 2.5 2

    , w =w0w1w2

    , y =

    2

    3.5

    3

    4.5

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    40

  • Example: Matrix X>X is not invertible

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    Design matrix and target vector:

    X =

    1 1 2

    1 2 2

    1 1.5 2

    1 2.5 2

    , w =w0w1w2

    , y =

    2

    3.5

    3

    4.5

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    40

  • What does the RSS loss function look like?

    • When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point

    In our example, this line is w0,eff = (w0 + 2w2)41

  • How do you fix this issue?

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    • Manually remove redundant features

    • But this can be tedious and non-trivial, especially when a feature isa linear combination of several other features

    Need a general way that doesn’t require manual feature engineering

    SOLUTION: Ridge Regression

    42

  • How do you fix this issue?

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    • Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is

    a linear combination of several other features

    Need a general way that doesn’t require manual feature engineering

    SOLUTION: Ridge Regression

    42

  • How do you fix this issue?

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    • Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is

    a linear combination of several other features

    Need a general way that doesn’t require manual feature engineering

    SOLUTION: Ridge Regression

    42

  • How do you fix this issue?

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    • Manually remove redundant features• But this can be tedious and non-trivial, especially when a feature is

    a linear combination of several other features

    Need a general way that doesn’t require manual feature engineering

    SOLUTION: Ridge Regression

    42

  • Ridge regression

    Intuition: what does a non-invertible X>X mean?

    Consider the SVD of this matrix:

    X>X = V

    λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

    V>

    where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1

    Fix the problem: ensure all singular values are non-zero:

    X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>

    where λ > 0 and I is the identity matrix.

    43

  • Ridge regression

    Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:

    X>X = V

    λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

    V>

    where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1

    Fix the problem: ensure all singular values are non-zero:

    X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>

    where λ > 0 and I is the identity matrix.

    43

  • Ridge regression

    Intuition: what does a non-invertible X>X mean?Consider the SVD of this matrix:

    X>X = V

    λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

    V>

    where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D. We will have a divide by zeroissue when computing (X>X )−1

    Fix the problem: ensure all singular values are non-zero:

    X>X + λI = V diag(λ1 + λ, λ2 + λ, · · · , λ)V>

    where λ > 0 and I is the identity matrix.

    43

  • Regularized least square (ridge regression)

    Solution

    w =(X>X + λI

    )−1X>y

    This is equivalent to adding an extra term to RSS(w)

    RSS(w)︷ ︸︸ ︷1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22︸ ︷︷ ︸

    regularization

    Benefits

    • Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later

    44

  • Regularized least square (ridge regression)

    Solution

    w =(X>X + λI

    )−1X>y

    This is equivalent to adding an extra term to RSS(w)

    RSS(w)︷ ︸︸ ︷1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22︸ ︷︷ ︸

    regularization

    Benefits

    • Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later

    44

  • Regularized least square (ridge regression)

    Solution

    w =(X>X + λI

    )−1X>y

    This is equivalent to adding an extra term to RSS(w)

    RSS(w)︷ ︸︸ ︷1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22︸ ︷︷ ︸

    regularization

    Benefits

    • Numerically more stable, invertible matrix• Force w to be small• Prevent overfitting — more on this later

    44

  • Applying this to our example

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    = 0.45 + 1.6x1 Should get this

    45

  • Applying this to our example

    sqft (1000’s) bathrooms sale price (100k)

    1 2 2

    2 2 3.5

    1.5 2 3

    2.5 2 4.5

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    = 0.45 + 1.6x1 Should get this

    45

  • Applying this to our example

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    = 0.45 + 1.6x1 Should get this

    Compute the solution for λ = 0.5w0w1w2

    = (X>X + λI)−1 X>yw0w1w2

    = 0.2081.247

    0.4166

    46

  • Applying this to our example

    The ’bathrooms’ feature is redundant, so we don’t need w2

    y = w0 + w1x1 + w2x2

    = w0 + w1x1 + w2 × 2, since x2 is always 2!= w0,eff + w1x1, where w0,eff = (w0 + 2w2)

    = 0.45 + 1.6x1 Should get this

    Compute the solution for λ = 0.5w0w1w2

    = (X>X + λI)−1 X>yw0w1w2

    = 0.2081.247

    0.4166

    46

  • How does λ affect the solution?

    w0w1w2

    = (X>X + λI)−1 X>yLet us plot w ′o = w0 + 2w2 and w1 for different λ ∈ [0.01, 20]

    0 5 10 15 20Hyperparameter λ

    0.50

    0.75

    1.00

    1.25

    1.50

    Par

    amet

    erV

    alu

    es

    w0,effw1

    Setting small λ gives almost the least-squares solution, but it can cause

    numerical instability in the inversion47

  • How to choose λ?

    λ is referred as hyperparameter

    • Associated with the estimation method, not the dataset• In contrast w is the parameter vector• Use validation set or cross-validation to find good choice of λ (more

    on this in the next lecture)

    0 5 10 15 20Hyperparameter λ

    0.50

    0.75

    1.00

    1.25

    1.50

    Par

    amet

    erV

    alu

    es

    w0,effw1

    48

  • Why is it called Ridge Regression?

    • When X>X is not invertible, the RSS objective function has a ridge,that is, the minimum is a line instead of a single point

    • Adding the regularizer term 12λ‖w‖22 yields a unique minimum, thusavoiding instability in matrix inversion

    49

  • Probabilistic Interpretation of Ridge Regression

    Add a term to the objective function.

    • Choose the parameters to not just minimize risk, but avoid beingtoo large.

    1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22

    Probabilistic interpretation: Place a prior on our weights

    • Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w

    Gaussian priors lead to ridge regression.

    50

  • Probabilistic Interpretation of Ridge Regression

    Add a term to the objective function.

    • Choose the parameters to not just minimize risk, but avoid beingtoo large.

    1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22

    Probabilistic interpretation: Place a prior on our weights

    • Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w

    Gaussian priors lead to ridge regression.

    50

  • Probabilistic Interpretation of Ridge Regression

    Add a term to the objective function.

    • Choose the parameters to not just minimize risk, but avoid beingtoo large.

    1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22

    Probabilistic interpretation: Place a prior on our weights

    • Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w

    Gaussian priors lead to ridge regression.

    50

  • Probabilistic Interpretation of Ridge Regression

    Add a term to the objective function.

    • Choose the parameters to not just minimize risk, but avoid beingtoo large.

    1

    2

    {w>X>Xw − 2

    (X>y

    )>w}

    +1

    2λ‖w‖22

    Probabilistic interpretation: Place a prior on our weights

    • Interpret w as a random variable• Assume that each wd is centered around zero• Use observed data D to update our prior belief on w

    Gaussian priors lead to ridge regression.

    50

  • Review: Probabilistic interpretation of Linear Regression

    Linear Regression model: Y = w>X + η

    η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)

    Frequentist interpretation: We assume that w is fixed.

    • The likelihood function maps parameters to probabilities

    L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏

    n

    p(yn|xn,w , σ20)

    • Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:

    wLMS = wML = arg maxw L(w , σ20)

    51

  • Review: Probabilistic interpretation of Linear Regression

    Linear Regression model: Y = w>X + η

    η ∼ N(0, σ20) is a Gaussian random variable and Y ∼ N(w>X , σ20)

    Frequentist interpretation: We assume that w is fixed.

    • The likelihood function maps parameters to probabilities

    L : w , σ20 7→ p(D|w , σ20) = p(y |X ,w , σ20) =∏

    n

    p(yn|xn,w , σ20)

    • Maximizing the likelihood with respect to w minimizes the RSS andyields the LMS solution:

    wLMS = wML = arg maxw L(w , σ20)

    51

  • Probabilistic interpretation of Ridge Regression

    Ridge Regression model: Y = w>X + η

    • Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2

    • To find w given data D, compute the posterior distribution of w :

    p(w |D) = p(D|w)p(w)p(D)

    • Maximum a posterior (MAP) estimate:

    wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)

    52

  • Probabilistic interpretation of Ridge Regression

    Ridge Regression model: Y = w>X + η

    • Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2

    • To find w given data D, compute the posterior distribution of w :

    p(w |D) = p(D|w)p(w)p(D)

    • Maximum a posterior (MAP) estimate:

    wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)

    52

  • Probabilistic interpretation of Ridge Regression

    Ridge Regression model: Y = w>X + η

    • Y ∼ N(w>X , σ20) is a Gaussian random variable (as before)• wd ∼ N(0, σ2) are i.i.d. Gaussian random variables (unlike before)• Note that all wd share the same variance σ2

    • To find w given data D, compute the posterior distribution of w :

    p(w |D) = p(D|w)p(w)p(D)

    • Maximum a posterior (MAP) estimate:

    wmap = arg maxw p(w |D) = arg maxw p(D|w)p(w)

    52

  • Estimating w

    Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).

    Joint likelihood of data and parameters (given σ0, σ):

    p(D,w) = p(D|w)p(w) =∏

    n

    p(yn|xn,w)∏

    d

    p(wd )

    Plugging in the Gaussian PDF, we get:

    log p(D,w) =∑

    n

    log p(yn|xn,w) +∑

    d

    log p(wd )

    = −∑

    n(w>xn − yn)22σ20

    −∑

    d

    1

    2σ2w2d + const

    MAP estimate: wmap = arg maxw log p(D,w)

    wmap = argminw

    ∑n(w

    >xn − yn)22σ20

    +1

    2σ2‖w‖22

    53

  • Estimating w

    Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):

    p(D,w) = p(D|w)p(w) =∏

    n

    p(yn|xn,w)∏

    d

    p(wd )

    Plugging in the Gaussian PDF, we get:

    log p(D,w) =∑

    n

    log p(yn|xn,w) +∑

    d

    log p(wd )

    = −∑

    n(w>xn − yn)22σ20

    −∑

    d

    1

    2σ2w2d + const

    MAP estimate: wmap = arg maxw log p(D,w)

    wmap = argminw

    ∑n(w

    >xn − yn)22σ20

    +1

    2σ2‖w‖22

    53

  • Estimating w

    Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):

    p(D,w) = p(D|w)p(w) =∏

    n

    p(yn|xn,w)∏

    d

    p(wd )

    Plugging in the Gaussian PDF, we get:

    log p(D,w) =∑

    n

    log p(yn|xn,w) +∑

    d

    log p(wd )

    = −∑

    n(w>xn − yn)22σ20

    −∑

    d

    1

    2σ2w2d + const

    MAP estimate: wmap = arg maxw log p(D,w)

    wmap = argminw

    ∑n(w

    >xn − yn)22σ20

    +1

    2σ2‖w‖22

    53

  • Estimating w

    Let x1, . . . , xN be i.i.d. with y |w , x ∼ N(w>x , σ20); wd ∼ N(0, σ2).Joint likelihood of data and parameters (given σ0, σ):

    p(D,w) = p(D|w)p(w) =∏

    n

    p(yn|xn,w)∏

    d

    p(wd )

    Plugging in the Gaussian PDF, we get:

    log p(D,w) =∑

    n

    log p(yn|xn,w) +∑

    d

    log p(wd )

    = −∑

    n(w>xn − yn)22σ20

    −∑

    d

    1

    2σ2w2d + const

    MAP estimate: wmap = arg maxw log p(D,w)

    wmap = argminw

    ∑n(w

    >xn − yn)22σ20

    +1

    2σ2‖w‖22

    53

  • Maximum a posterior (MAP) estimate

    E(w) =∑

    n

    (w>xn − yn)2 + λ‖w‖22

    where λ > 0 is used to denote σ20/σ2. This extra term ‖w‖22 is called

    regularization/regularizer and controls the magnitude of w .

    Intuitions

    • If λ→ +∞, then σ20 � σ2: the variance of noise is far greater than whatour prior model can allow for w . In this case, our prior model on w willforce w to be close to zero. Numerically,

    wmap → 0

    • If λ→ 0, then we trust our data more. Numerically,

    wmap → w lms = argmin∑

    n

    (w>xn − yn)2

    54

  • Outline

    1. Review of Linear Regression

    2. Gradient Descent Methods

    3. Feature Scaling

    4. Ridge regression

    5. Non-linear Basis Functions

    6. Overfitting

    55

  • Non-linear Basis Functions

  • Outline

    Review of Linear Regression

    Gradient Descent Methods

    Feature Scaling

    Ridge regression

    Non-linear Basis Functions

    Overfitting

    56

  • Is a linear modeling assumption always a good idea?

    Figure 1: Sale price can saturate as sq.footage increases

    x

    t

    0 1

    −1

    0

    1

    Figure 2: Temperature has cyclic variations over each year

    57

  • General nonlinear basis functions

    We can use a nonlinear mapping to a new feature vector:

    φ(x) : x ∈ RD → z ∈ RM

    • M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D

    We can apply existing learning methods on the transformed data:

    • linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc

    58

  • General nonlinear basis functions

    We can use a nonlinear mapping to a new feature vector:

    φ(x) : x ∈ RD → z ∈ RM

    • M is dimensionality of new features z (or φ(x))• M could be greater than, less than, or equal to D

    We can apply existing learning methods on the transformed data:

    • linear methods: prediction is based on w>φ(x)• other methods: nearest neighbors, decision trees, etc

    58

  • Regression with nonlinear basis

    Residual sum of squares∑n

    [w>φ(xn)− yn]2

    where w ∈ RM , the same dimensionality as the transformed featuresφ(x).

    The LMS solution can be formulated with the new design matrix

    Φ =

    φ(x1)>

    φ(x2)>...

    φ(xN )>

    ∈ RN×M , w lms =(

    Φ>Φ)−1

    Φ>y

    59

  • Regression with nonlinear basis

    Residual sum of squares∑n

    [w>φ(xn)− yn]2

    where w ∈ RM , the same dimensionality as the transformed featuresφ(x).

    The LMS solution can be formulated with the new design matrix

    Φ =

    φ(x1)>

    φ(x2)>...

    φ(xN )>

    ∈ RN×M , w lms =(

    Φ>Φ)−1

    Φ>y

    59

  • Example: Lot of Flexibility in Designing New Features!

    x1, Area (1k sqft) x21 , Area

    2 Price (100k)

    1 1 2

    2 4 3.5

    1.5 2.25 3

    2.5 6.25 4.5

    Figure 3: Add x21 as a feature to allow us to fit quadratic, instead of linear

    functions of the house area x1

    60

  • Example: Lot of Flexibility in Designing New Features!

    x1, front (100ft) x2 depth (100ft) 10x1x2, Lot (1k sqft) Price (100k)

    0.5 0.5 2.5 2

    0.5 1 5 3.5

    0.8 1.5 12 3

    1.0 1.5 15 4.5

    Figure 4: Instead of having frontage and depth as two separate features, it

    may be better to consider the lot-area, which is equal to frontage×depth61

  • Example with regression

    Polynomial basis functions

    φ(x) =

    1

    x

    x2

    ...

    xM

    ⇒ f (x) = w0 +M∑

    m=1

    wmxm

    Fitting samples from a sine function:

    underfitting since f (x) is too simple

    x

    t

    M = 0

    0 1

    −1

    0

    1

    x

    t

    M = 1

    0 1

    −1

    0

    1

    62

  • Example with regression

    Polynomial basis functions

    φ(x) =

    1

    x

    x2

    ...

    xM

    ⇒ f (x) = w0 +M∑

    m=1

    wmxm

    Fitting samples from a sine function:

    underfitting since f (x) is too simple

    x

    t

    M = 0

    0 1

    −1

    0

    1

    x

    t

    M = 1

    0 1

    −1

    0

    1

    62

  • Adding high-order terms

    M=3

    x

    t

    M = 3

    0 1

    −1

    0

    1

    M=9: overfitting

    x

    t

    M = 9

    0 1

    −1

    0

    1

    More complex features lead to better results on the training data, but

    potentially worse results on new data, e.g., test data!

    63

  • Adding high-order terms

    M=3

    x

    t

    M = 3

    0 1

    −1

    0

    1

    M=9: overfitting

    x

    t

    M = 9

    0 1

    −1

    0

    1

    More complex features lead to better results on the training data, but

    potentially worse results on new data, e.g., test data!

    63

  • Overfitting

  • Outline

    Review of Linear Regression

    Gradient Descent Methods

    Feature Scaling

    Ridge regression

    Non-linear Basis Functions

    Overfitting

    64

  • Overfitting

    Parameters for higher-order polynomials are very large

    M = 0 M = 1 M = 3 M = 9

    w0 0.19 0.82 0.31 0.35

    w1 -1.27 7.99 232.37

    w2 -25.43 -5321.83

    w3 17.37 48568.31

    w4 -231639.30

    w5 640042.26

    w6 -1061800.52

    w7 1042400.18

    w8 -557682.99

    w9 125201.43

    65

  • Overfitting can be quite disastrous

    Fitting the housing price data with large M:

    Predicted price goes to zero (and is ultimately negative) if you buy a big

    enough house!

    This is called poor generalization/overfitting.

    66

  • Detecting overfitting

    Plot model complexity versus objective function:

    • X axis: model complexity, e.g., M• Y axis: error, e.g., RSS, RMS

    (square root of RSS), 0-1 loss

    Compute the objective on a training and

    test dataset.

    M

    ERMS

    0 3 6 90

    0.5

    1TrainingTest

    As a model increases in complexity:

    • Training error keeps improving• Test error may first improve but eventually will deteriorate

    67

  • Dealing with overfitting

    Try to use more training data

    x

    t

    M = 9

    0 1

    −1

    0

    1

    x

    t

    N = 15

    0 1

    −1

    0

    1

    x

    t

    N = 100

    0 1

    −1

    0

    1

    What if we do not have a lot of data?

    68

  • Dealing with overfitting

    Try to use more training data

    x

    t

    M = 9

    0 1

    −1

    0

    1

    x

    t

    N = 15

    0 1

    −1

    0

    1

    x

    t

    N = 100

    0 1

    −1

    0

    1

    What if we do not have a lot of data?

    68

  • Dealing with overfitting

    Try to use more training data

    x

    t

    M = 9

    0 1

    −1

    0

    1

    x

    t

    N = 15

    0 1

    −1

    0

    1

    x

    t

    N = 100

    0 1

    −1

    0

    1

    What if we do not have a lot of data?

    68

  • Regularization methods

    Intuition: Give preference to ‘simpler’ models

    • How do we define a simple linear regression model — w>x?• Intuitively, the weights should not be “too large”

    M = 0 M = 1 M = 3 M = 9

    w0 0.19 0.82 0.31 0.35

    w1 -1.27 7.99 232.37

    w2 -25.43 -5321.83

    w3 17.37 48568.31

    w4 -231639.30

    w5 640042.26

    w6 -1061800.52

    w7 1042400.18

    w8 -557682.99

    w9 125201.43

    69

  • Next Class: Overfitting, Regularization and

    the Bias-variance trade-off

    69

    Review of Linear RegressionGradient Descent MethodsFeature ScalingRidge regressionNon-linear Basis FunctionsOverfitting