Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial...
Transcript of Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial...
![Page 1: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/1.jpg)
Linear Regression
Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London
https://deisenroth.cc
AIMS Rwanda and AIMS Ghana
March/April 2020
![Page 2: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/2.jpg)
Reference
MATHEMATICS FOR MACHINE LEARNING
Marc Peter DeisenrothA. Aldo Faisal
Cheng Soon Ong
MA
THEM
ATICS FO
R MA
CHIN
E LEARN
ING
DEIS
ENR
OTH
ET AL.
The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.
MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.
A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.
CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.
Cover image courtesy of Daniel Bosma / Moment / Getty Images
Cover design by Holly Johnson
Deisenrith et al. 9781108455145 Cover. C M Y K
https://mml-book.com
Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 2
![Page 3: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/3.jpg)
Regression Problems
Regression (curve fitting)
Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.
−4 −2 0 2 4
x
−4
−2
0
2
4
y
Typically parametrize the function f with parameters θ
Linear regression: Consider functions f that are linear in theparameters
Marc Deisenroth (UCL) Linear Regression March/April 2020 3
![Page 4: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/4.jpg)
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)
Marc Deisenroth (UCL) Linear Regression March/April 2020 4
![Page 5: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/5.jpg)
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)
Marc Deisenroth (UCL) Linear Regression March/April 2020 4
![Page 6: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/6.jpg)
Linear Regression Functions
Straight lines
y “ fpx,θq “ θ0 ` θ1x “”
θ0 θ1
ı
«
1
x
ff
Polynomials
y “ fpx,θq “Mÿ
m“0
θmxm “
”
θ0 ¨ ¨ ¨ θM
ı
»
—
—
–
1...xM
fi
ffi
ffi
fl
Radial basis function networks
y “ fpx,θq “Mÿ
m“1
θm exp´
´ 12px´ µmq
2¯
−4 −2 0 2 4x
−2
−1
0
1
2
f(x
)
−4 −2 0 2 4x
−20
−10
0
10
f(x
)
−4 −2 0 2 4x
−3
−2
−1
0
1
f(x
)Marc Deisenroth (UCL) Linear Regression March/April 2020 4
![Page 7: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/7.jpg)
Linear Regression Model and Setting
y “ xJθ ` ε , ε „ N`
0, σ2˘
Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚
Maximum Likelihood EstimationMaximum a Posteriori Estimation
Marc Deisenroth (UCL) Linear Regression March/April 2020 5
![Page 8: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/8.jpg)
Linear Regression Model and Setting
y “ xJθ ` ε , ε „ N`
0, σ2˘
Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚
Maximum Likelihood EstimationMaximum a Posteriori Estimation
Marc Deisenroth (UCL) Linear Regression March/April 2020 5
![Page 9: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/9.jpg)
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
![Page 10: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/10.jpg)
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
![Page 11: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/11.jpg)
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
![Page 12: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/12.jpg)
Maximum Likelihood
Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN
Find parameters θ˚ that maximize the likelihood
ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź
n“1
N`
yn |xJnθ, σ
2˘
Log-transformation Maximize the log likelihood
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
,
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2 ` const
Marc Deisenroth (UCL) Linear Regression March/April 2020 6
![Page 13: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/13.jpg)
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
![Page 14: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/14.jpg)
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
![Page 15: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/15.jpg)
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
![Page 16: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/16.jpg)
Maximum Likelihood (2)
With
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2pyn ´ x
Jnθq
2` const
we get
log ppy|X,θq “Nÿ
n“1
logN`
yn |xJnθ, σ
2˘
“ ´1
2σ2
Nÿ
n“1
pyn ´ xJnθq
2` const
“ ´1
2σ2py ´XθqJpy ´Xθq ` const
“ ´1
2σ2}y ´Xθ}2 ` const
Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)
θML “ pXJXq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 7
![Page 17: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/17.jpg)
Making Predictions
y “ xJθ ` ε, ε „ N`
0, σ2˘
Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:
ppy˚|x˚,θMLq “ N
`
y˚ |xJ˚ θ
ML, σ2˘
Measurement noise variance σ2 assumed known
In the absence of noise (σ2 “ 0), the prediction will bedeterministic
Marc Deisenroth (UCL) Linear Regression March/April 2020 8
![Page 18: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/18.jpg)
Example 1: Linear Functions
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
y “ θ0 ` θ1x` ε, ε „ N`
0, σ2˘
At any query point x˚ we obtain the mean prediction as
Ery˚|θML, x˚s “ θML
0 ` θML1 x˚
Marc Deisenroth (UCL) Linear Regression March/April 2020 9
![Page 19: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/19.jpg)
Nonlinear Functions
y “ φpxqJθ ` ε “Mÿ
m“0
θmxm ` ε
Polynomial regression with features
φpxq “ r1, x, x2, . . . , xM sJ
Maximum likelihood estimator:
θML “ pΦJΦq´1ΦJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 10
![Page 20: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/20.jpg)
Nonlinear Functions
y “ φpxqJθ ` ε “Mÿ
m“0
θmxm ` ε
Polynomial regression with features
φpxq “ r1, x, x2, . . . , xM sJ
Maximum likelihood estimator:
θML “ pΦJΦq´1ΦJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 10
![Page 21: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/21.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Figure: Training data
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 22: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/22.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 2nd-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 23: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/23.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 4th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 24: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/24.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 6th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 25: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/25.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 8th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 26: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/26.jpg)
Example 2: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
Figure: 10th-order polynomial
Marc Deisenroth (UCL) Linear Regression March/April 2020 11
![Page 27: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/27.jpg)
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
![Page 28: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/28.jpg)
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
![Page 29: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/29.jpg)
Overfitting
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
Training error
Test error
Training error decreases with higher flexibility of the model
We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?
Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data
Marc Deisenroth (UCL) Linear Regression March/April 2020 12
![Page 30: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/30.jpg)
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
![Page 31: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/31.jpg)
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
![Page 32: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/32.jpg)
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
![Page 33: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/33.jpg)
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
![Page 34: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/34.jpg)
MAP Estimation
Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values
Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters
Penalize extreme values that are implausible under that prior
Choose θ˚ as the parameter that maximizes the (log) parameterposterior
log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon
log-likelihood
` log ppθqlooomooon
log-prior
` const
Log-prior induces a direct penalty on the parameters
Maximum a posteriori estimate (regularized least squares)
Marc Deisenroth (UCL) Linear Regression March/April 2020 13
![Page 35: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/35.jpg)
MAP Estimation (2)
Gaussian parameter prior ppθq “ N`
0, α2I˘
Log-posterior distribution:
log ppθ|X,yq “ ´ 12σ2 py ´Xθq
Jpy ´Xθq ´ 12α2θ
Jθ ` const
“ ´ 12σ2 }y ´Xθ}
2 ´ 12α2 }θ}
2 ` const
Compute gradient with respect to θ, set it to 0
Maximum a posteriori estimate:
θMAP “ pXJX ` σ2
α2 Iq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 14
![Page 36: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/36.jpg)
MAP Estimation (2)
Gaussian parameter prior ppθq “ N`
0, α2I˘
Log-posterior distribution:
log ppθ|X,yq “ ´ 12σ2 py ´Xθq
Jpy ´Xθq ´ 12α2θ
Jθ ` const
“ ´ 12σ2 }y ´Xθ}
2 ´ 12α2 }θ}
2 ` const
Compute gradient with respect to θ, set it to 0
Maximum a posteriori estimate:
θMAP “ pXJX ` σ2
α2 Iq´1XJy
Marc Deisenroth (UCL) Linear Regression March/April 2020 14
![Page 37: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/37.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Figure: Training data
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 38: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/38.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 2nd-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 39: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/39.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 4th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 40: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/40.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 6th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 41: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/41.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 8th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 42: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/42.jpg)
Example: Polynomial Regression
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
Figure: 10th-order polynomial
Mean prediction:
Ery˚|x˚,θMAPs “ φJpx˚qθ
MAP
Marc Deisenroth (UCL) Linear Regression March/April 2020 15
![Page 43: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/43.jpg)
Generalization Error
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
(tes
tse
t)
Maximum likelihood
Maximum a posteriori
MAP estimation “delays” the problem of overfitting
It does not provide a general solutionNeed a more principled solution
Marc Deisenroth (UCL) Linear Regression March/April 2020 16
![Page 44: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/44.jpg)
Bayesian Linear Regression
y “ φJpxqθ ` ε , ε „ N`
0, σ2˘
Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them
Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:
ppy˚|x˚q “
ż
ppy˚|x˚,θqppθqdθ
Prediction no longer depends on θ
Predictive distribution reflects the uncertainty about the “correct”parameter setting
Marc Deisenroth (UCL) Linear Regression March/April 2020 17
![Page 45: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/45.jpg)
Bayesian Linear Regression
y “ φJpxqθ ` ε , ε „ N`
0, σ2˘
Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them
Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:
ppy˚|x˚q “
ż
ppy˚|x˚,θqppθqdθ
Prediction no longer depends on θ
Predictive distribution reflects the uncertainty about the “correct”parameter setting
Marc Deisenroth (UCL) Linear Regression March/April 2020 17
![Page 46: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/46.jpg)
Example
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−4 −2 0 2 4x
−4
−2
0
2
4
y
Light-gray: uncertainty due to noise (same as in MLE/MAP)
Dark-gray: uncertainty due to parameter uncertainty
Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)
Marc Deisenroth (UCL) Linear Regression March/April 2020 18
![Page 47: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/47.jpg)
Example
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−4 −2 0 2 4x
−4
−2
0
2
4
y
Light-gray: uncertainty due to noise (same as in MLE/MAP)
Dark-gray: uncertainty due to parameter uncertainty
Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)
Marc Deisenroth (UCL) Linear Regression March/April 2020 18
![Page 48: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/48.jpg)
Model for Bayesian Linear Regression
Prior ppθq “ N`
m0, S0
˘
,
Likelihood ppy|x,θq “ N`
y |φJpxqθ, σ2˘
Parameter θ becomes a latent (random) variable
Prior distribution induces a distribution over plausible functions
Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior
Marc Deisenroth (UCL) Linear Regression March/April 2020 19
![Page 49: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/49.jpg)
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:Derive this
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
![Page 50: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/50.jpg)
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
![Page 51: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/51.jpg)
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
![Page 52: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/52.jpg)
Parameter Posterior and Predictions
Prior ppθq “ N`
m0, S0
˘
is Gaussian posterior is Gaussian:
ppθ|X,yq “ N`
mN , SN˘
SN “ pS´10 ` σ´2ΦJΦq´1
mN “ SN pS´10 m0 ` σ
´2ΦJyq
Mean mN identical to MAP estimate
Assume a Gaussian distribution ppθq “ N`
mN , SN˘
. Then
ppy˚|x˚q “ N`
y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ
2˘
φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9
Marc Deisenroth (UCL) Linear Regression March/April 2020 20
![Page 53: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/53.jpg)
Marginal Likelihood
Marginal likelihood can be computed analytically.
With ppθq “ N`
µ, Σ˘
ppy|Xq “
ż
ppy|X,θqppθqdθ “ N`
y |Φµ, ΦΣΦJ ` σ2I˘
Derivation via completing the squares (see Section 9.3.5 ofMML book)
Marc Deisenroth (UCL) Linear Regression March/April 2020 21
![Page 54: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/54.jpg)
Distribution over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
−4 −2 0 2 4a
−4
−3
−2
−1
0
1
2
3
4b
Marc Deisenroth (UCL) Linear Regression March/April 2020 22
![Page 55: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/55.jpg)
Sampling from the Prior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
fipxq “ ai ` bix, rai, bis „ ppa, bq
Marc Deisenroth (UCL) Linear Regression March/April 2020 23
![Page 56: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/56.jpg)
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets
−10 −5 0 5 10x
−15
−10
−5
0
5
10
15y
Marc Deisenroth (UCL) Linear Regression March/April 2020 24
![Page 57: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/57.jpg)
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
ppa, b|X,yq “ N`
mN , SN˘
Posterior
−4 −2 0 2 4a
−4
−3
−2
−1
0
1
2
3
4b
Marc Deisenroth (UCL) Linear Regression March/April 2020 25
![Page 58: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/58.jpg)
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ fpxq ` ε “ a` bx` ε , ε „ N`
0, σ2n˘
rai, bis „ ppa, b|X,yq
fi “ ai ` bix
Marc Deisenroth (UCL) Linear Regression March/April 2020 26
![Page 59: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/59.jpg)
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
![Page 60: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/60.jpg)
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
![Page 61: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/61.jpg)
Fitting Nonlinear Functions
Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
Example: Radial-basis-function (RBF) network
fpxq “nÿ
i“1
θiφipxq , θi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Marc Deisenroth (UCL) Linear Regression March/April 2020 27
![Page 62: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/62.jpg)
Illustration: Fitting a Radial Basis Function Network
φipxq “ exp`
´ 12px´ µiq
Jpx´ µiq˘
-5 0 5x
-2
0
2f(
x)
Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s
Marc Deisenroth (UCL) Linear Regression March/April 2020 28
![Page 63: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/63.jpg)
Samples from the RBF Prior
fpxq “nÿ
i“1
θiφipxq , ppθq “ N`
0, I˘
-5 0 5x
-4
-2
0
2
4f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 29
![Page 64: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/64.jpg)
Samples from the RBF Posterior
fpxq “nÿ
i“1
θiφipxq , ppθ|X,yq “ N`
mN , SN˘
-5 0 5x
-4
-2
0
2
4f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 30
![Page 65: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/65.jpg)
RBF Posterior
-5 0 5x
-2
0
2f(
x)
Marc Deisenroth (UCL) Linear Regression March/April 2020 31
![Page 66: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/66.jpg)
Limitations
-5 0 5x
-2
0
2
f(x)
Feature engineering (what basis functions to use?)Finite number of features:
Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions
Marc Deisenroth (UCL) Linear Regression March/April 2020 32
![Page 67: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/67.jpg)
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
![Page 68: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/68.jpg)
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
![Page 69: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/69.jpg)
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
![Page 70: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/70.jpg)
Approach
Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Place a prior on functionsMake assumptions on the distribution of functions
Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian process
Marc Deisenroth (UCL) Linear Regression March/April 2020 33
![Page 71: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/71.jpg)
Summary
0 2 4 6 8 10Degree of polynomial
0
2
4
6
8
10
RM
SE
(tes
tse
t)
Maximum likelihood
Maximum a posteriori
−4 −2 0 2 4x
−4
−2
0
2
4
y
Training data
MLE
MAP
BLR
−2 0 2a
−4
−3
−2
−1
0
1
2
3
4
b
−10 −5 0 5 10x
−15
−10
−5
0
5
10
15
y
Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions
Marc Deisenroth (UCL) Linear Regression March/April 2020 34
![Page 72: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/72.jpg)
Appendix
Marc Deisenroth (UCL) Linear Regression March/April 2020 35
![Page 73: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/73.jpg)
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
![Page 74: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/74.jpg)
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
![Page 75: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/75.jpg)
Joint Gaussian Distribution
Joint Gaussian distribution
ppx,yq “ N˜«
µx
µy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal:
ppx q “
ż
ppx , y qd y
“ N`
µx , Σxx
˘
Conditional:
ppx|yq “ N`
µx|y, Σx|y
˘
µx|y “ µx ` Σxy Σ´1yy py ´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Marc Deisenroth (UCL) Linear Regression March/April 2020 36
![Page 76: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/76.jpg)
Linear Transformation of Gaussian RandomVariables
If x „ N`
x |µ, Σ˘
and z “ Ax` b then
ppzq “ N`
z |Aµ` b, AΣAJ˘
Marc Deisenroth (UCL) Linear Regression March/April 2020 37
![Page 77: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/77.jpg)
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
![Page 78: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/78.jpg)
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
![Page 79: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/79.jpg)
Product of Two Gaussians
x P RD. Then:
N`
x |a, A˘
N`
x | b, B˘
“ ZN`
x | c, C˘
C “ pA´1 `B´1q´1
c “ CpA´1a`B´1bq
Z “ p2πq´D2 |A`B| exp
`
´12pa´ bq
JpA`Bq´1pa´ bq˘
Product of two Gaussians is an unnormalized Gaussian
The “un-normalizer” Z has a Gaussian functional form:
Z “ N`
a | b, A`B˘
“ N`
b |a, A`B˘
Note: This is not a distribution (no random variables)
Marc Deisenroth (UCL) Linear Regression March/April 2020 38
![Page 80: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/80.jpg)
Example: Marginalization of a Product
p1pxq “ N`
x |a, A˘
p2pxq “ N`
x | b, B˘
Thenż
p1pxqp2pxqdx “
Z “ N`
a | b, A`B˘
P R
Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.
Marc Deisenroth (UCL) Linear Regression March/April 2020 39
![Page 81: Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk](https://reader034.fdocuments.us/reader034/viewer/2022052005/6018f3b0305cb67bba1f4fa3/html5/thumbnails/81.jpg)
Example: Marginalization of a Product
p1pxq “ N`
x |a, A˘
p2pxq “ N`
x | b, B˘
Thenż
p1pxqp2pxqdx “ Z “ N`
a | b, A`B˘
P R
Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.
Marc Deisenroth (UCL) Linear Regression March/April 2020 39