TMI04.2 Linear Regression

36
Institut für Informatik Linear regression with gradient descent Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng

description

jj

Transcript of TMI04.2 Linear Regression

  • Institut fr Informatik

    Linear regressionwith gradient descent

    Ingmar Schuster

    Patrick Jhnichen

    using slides by Andrew Ng

  • Linear regression w. gradient descent 2

    This lecture covers

    Linear Regression

    Hypothesis formulation, hypthesis space

    Optimizing Cost with Gradient Descent

    Using multiple input features with Linear Regression

    Feature Scaling

    Nonlinear Regression

    Optimizing Cost using derivatives

  • Institut fr Informatik

    Linear Regression

  • Linear regression w. gradient descent 4

    Price for buying a flat in Berlin

    Supervised learning problem

    Expected answer available for each example in data

    Regression Problem

    Prediction of continuous output

  • Linear regression w. gradient descent 5

    Training data of flat prices

    m Number of training examples

    x is input (predictor) variable features in ML-speek

    y is output (response) variable

    Notation

    Square meters Price in 1000

    73 174

    146 367

    38 69

    124 257

    ... ...

  • Linear regression w. gradient descent 6

    Learning procedure

    Hypothesis parameters

    linear regression,one input variable (univariate)

    How to choose parameters?

    Training data

    Learning Algorithm

    hypothesis(mapping between input and output)

    Sizeof flat

    Estimatedprice

  • Optimization objective

    Purpose of learning algorithm expressed inoptimization objective and cost function (often called J)

    Few false negatives

    ...

    Fit data well

    Few false positives

  • Linear regression w. gradient descent 8

    Fitting data well: least squares cost function

    In regression almost always want to fit data well

    smallest average distance to points in training data(h(x) close to y for (x,y) in training data)

    Cost function often named J

    Squaring

    Penalty for positive and negative deviations the same

    Penalty for large deviations stronger

    Number of training instances

    Number of training instances

  • Linear regression w. gradient descent 9

    Optimizing Costwith Gradient Descent

  • Linear regression w. gradient descent 10

    Gradient Descent Outline

    Want to minimize

    Start with random

    Keep changing to reduceuntil we end up at minimum

  • 3D plots and contour plots

    Stepwise descent towards minimum

    Stepwise descent towards minimum

    Derivatives work only for

    few parameters

    Derivatives work only for

    few parameters

    [plot by Andrew Ng]

  • Linear regression w. gradient descent 12

    Gradient descent

    beware: incrementalupdate incorrect!

    steps become smallerwithout changing learning rate

    steps become smallerwithout changing learning rate

    partial derivative

    partial derivative

  • Linear regression w. gradient descent 13

    Learning Rate considerations

    Small learning rate leads to slowconvergence

    Overly large learning rate may not lead to convergence or to divergence

    Often

  • Linear regression w. gradient descent 14

    Checking convergence

    Gradient descent works correctly if decreases with every step

    Possible convergence criterion: converged if decreases by less than constant

  • Linear regression w. gradient descent 15

    Local Minima

    Gradient descent can get stuck at local minima(e.g. J not squared error for regression with only one variable)

    Random restart with differentparameter(s)

    X

  • Linear regression w. gradient descent 16

    Variants of Gradient Descent

    Using multiple input features

  • Linear regression w. gradient descent 17

    Multiple features

    Square meters

    Bedrooms Floors Age of building (years)

    Price in 1000

    x1 x2 x3 x4 y

    200 5 1 45 460

    131 3 2 40 232

    142 3 2 30 315

    756 2 1 36 178

    Notation

  • Linear regression w. gradient descent 18

    Hypothesis representation

    More compact

    with definition

  • Linear regression w. gradient descent 19

    Gradient descent for multiple variables

    Generalized cost function

    Generalized gradient descent

  • Linear regression w. gradient descent 20

    Partial derivative of cost function for multiple variables

    Calculating the partial derivative

  • Linear regression w. gradient descent 21

    Gradient descent for multiple variables

    Simplified gradient descent

  • Linear regression w. gradient descent 22

    Conversion considerations for multiple variables

    With multiple variables, comparison of variance in data is lost (scales can vary strongly)

    Gradient descent converges faster for features on similar scale

    Square meters 30 - 400

    Bedrooms 1 - 10

    Price80 000-2 000 000

  • Linear regression w. gradient descent 23

    Feature Scaling

  • Linear regression w. gradient descent 24

    Feature scaling

    Different approaches for converting features to comparable scale

    Min-Max-Scaling makes all data fall into range [0, 1]

    (for single data point of feature j)

    Z-score conversion

  • Linear regression w. gradient descent 25

    Z-Score conversion

    Center data on 0

    Scale data so majority falls into range [-1, 1]

    Z-score conversion of single data point for feature j

    empirical standard deviation (sigma)

    empirical standard deviation (sigma)

    mean / empirical expected value

    (mu)

    mean / empirical expected value

    (mu)

  • Linear regression w. gradient descent 26

    Visualizing standard deviation

  • Linear regression w. gradient descent 27

    Nonlinear Regression(by cheap trickery)

  • Linear regression w. gradient descent 28

    Nonlinear Regression Problems

  • Linear regression w. gradient descent 29

    Nonlinear Regression Problems (linear approximation)

  • Linear regression w. gradient descent 30

    Nonlinear Regression Problems (nonlinear hypothesis)

  • Linear regression w. gradient descent 31

    Nonlinear Regression with cheap trickery

    Linear Regression can be used for Nonlinear Problems

    Choose nonlinear hypothesis space

  • Linear regression w. gradient descent 32

    Optimizing costusing derivatives

  • Linear regression w. gradient descent 33

    Comparison Gradient Descent vs. Setting derivative = 0

    Instead of Gradient descent solve

    for all i

  • Linear regression w. gradient descent 34

    Comparison Gradient Descent vs. Setting derivative = 0

    Gradient Descent

    Need to choose

    Needs many iterations, random restarts etc.

    Works well for many features

    Derivation

    No need to choose

    No iterations

    Slow for many features

  • Linear regression w. gradient descent 35

    This lecture covers

    Linear Regression

    Hypothesis formulation, hypthesis space

    Optimizing Cost with Gradient Descent

    Using multiple input features with Linear Regression

    Feature Scaling

    Nonlinear Regression

    Optimizing Cost using derivatives

  • Linear regression w. gradient descent 36

    Pictures

    Some public domain plots from en.wikipedia.org and de.wikipedia.org

    Folie 1Folie 2Folie 3Folie 4Folie 5Folie 6Folie 7Folie 8Folie 9Folie 10Folie 11Folie 12Folie 13Folie 14Folie 15Folie 16Folie 17Folie 18Folie 19Folie 20Folie 21Folie 22Folie 23Folie 24Folie 25Folie 26Folie 27Folie 28Folie 29Folie 30Folie 31Folie 32Folie 33Folie 34Folie 35Folie 36