Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial...

Transcript of Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial...

Linear Regression

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@[email protected]

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

[email protected]

https://deisenroth.cc

Page 2: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Reference

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

MA

THEM

ATICS FO

R MA

CHIN

E LEARN

ING

DEIS

ENR

OTH

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

https://mml-book.com

Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 2

Page 3: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Regression Problems

Regression (curve fitting)

Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.

−4 −2 0 2 4

x

−4

−2

0

2

4

y

Typically parametrize the function f with parameters θ

Linear regression: Consider functions f that are linear in theparameters

Page 4: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

–

1...xM

fi

ffi

fl

Radial basis function networks

m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Page 5: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

–

1...xM

fi

ffi

fl

m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Page 6: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

m“0

θmxm “

”

θ0 ¨ ¨ ¨ θM

ı

»

—

–

1...xM

fi

ffi

fl

m“1

θm exp´

´ 12px´ µmq

2¯

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Page 7: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Page 8: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Page 9: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

2˘

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Page 10: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

n“1

N`

yn |xJnθ, σ

2˘

n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Page 11: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

n“1

N`

yn |xJnθ, σ

2˘

n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Page 12: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood

n“1

N`

yn |xJnθ, σ

2˘

n“1

logN`

yn |xJnθ, σ

2˘

,

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Page 13: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Page 14: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

Page 15: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

Page 16: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

With

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

n“1

logN`

yn |xJnθ, σ

2˘

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

Page 17: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Making Predictions

y “ xJθ ` ε, ε „ N`

0, σ2˘

Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:

ppy˚|x˚,θMLq “ N

`

y˚ |xJ˚ θ

ML, σ2˘

Measurement noise variance σ2 assumed known

In the absence of noise (σ2 “ 0), the prediction will bedeterministic

Page 18: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 1: Linear Functions

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

y “ θ0 ` θ1x` ε, ε „ N`

0, σ2˘

At any query point x˚ we obtain the mean prediction as

Ery˚|θML, x˚s “ θML

0 ` θML1 x˚

Page 19: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Page 20: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Page 21: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Page 22: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 2nd-order polynomial

Page 23: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 4th-order polynomial

Page 24: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Page 25: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Page 26: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Page 27: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Page 28: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0

2

4

6

8

10

RM

SE

Training error

Test error

Page 29: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Overfitting

0

2

4

6

8

10

RM

SE

Training error

Test error

Page 30: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Page 31: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

log-likelihood

log-prior

` const

Page 32: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

log-likelihood

log-prior

` const

Page 33: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

log-likelihood

log-prior

` const

Page 34: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation

log-likelihood

log-prior

` const

Page 35: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Page 36: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Page 37: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Page 38: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 2nd-order polynomial

Mean prediction:

MAP

Page 39: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Mean prediction:

MAP

Page 40: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Mean prediction:

MAP

Page 41: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Mean prediction:

MAP

Page 42: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Mean prediction:

MAP

Page 43: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Generalization Error

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

MAP estimation “delays” the problem of overfitting

It does not provide a general solutionNeed a more principled solution

Page 44: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Page 45: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Page 46: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Page 47: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Page 48: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Model for Bayesian Linear Regression

Prior ppθq “ N`

m0, S0

˘

,

Likelihood ppy|x,θq “ N`

y |φJpxqθ, σ2˘

Parameter θ becomes a latent (random) variable

Prior distribution induces a distribution over plausible functions

Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior

Page 49: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:Derive this

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

2˘

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Page 50: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

2˘

Page 51: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Prior ppθq “ N`

m0, S0

˘

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

2˘

Page 52: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Prior ppθq “ N`

m0, S0

˘

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

2˘

Page 53: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Marginal Likelihood

Marginal likelihood can be computed analytically.

With ppθq “ N`

µ, Σ˘

ppy|Xq “

ż

ppy|X,θqppθqdθ “ N`

y |Φµ, ΦΣΦJ ` σ2I˘

Derivation via completing the squares (see Section 9.3.5 ofMML book)

Page 54: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Distribution over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Page 55: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Prior over Functions

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

fipxq “ ai ` bix, rai, bis „ ppa, bq

Page 56: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Sampling from the Posterior over Functions

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15y

Page 57: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

ppa, b|X,yq “ N`

mN , SN˘

Posterior

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Page 58: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

rai, bis „ ppa, b|X,yq

fi “ ai ` bix

Page 59: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Page 60: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

fpxq “nÿ

i“1

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

Page 61: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

fpxq “nÿ

i“1

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

Page 62: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s

Page 63: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Samples from the RBF Prior

fpxq “nÿ

i“1

θiφipxq , ppθq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)

Page 64: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Samples from the RBF Posterior

fpxq “nÿ

i“1

θiφipxq , ppθ|X,yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)

Page 65: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

RBF Posterior

-5 0 5x

-2

0

2f(

x)

Page 66: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Limitations

-5 0 5x

-2

0

2

f(x)

Feature engineering (what basis functions to use?)Finite number of features:

Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions

Page 67: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Page 68: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Gaussian process

Page 69: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Gaussian process

Page 70: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Approach

Gaussian process

Page 71: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Summary

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−2 0 2a

−4

−3

−2

−1

0

1

2

3

4

b

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15

y

Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions

Page 72: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Appendix

Page 73: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Page 74: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

Page 75: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

Page 76: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Linear Transformation of Gaussian RandomVariables

If x „ N`

x |µ, Σ˘

and z “ Ax` b then

ppzq “ N`

z |Aµ` b, AΣAJ˘

Page 77: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Page 78: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Page 79: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Page 80: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “

Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Page 81: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “ Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.