Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes...
Transcript of Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes...
![Page 1: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/1.jpg)
Distributed Gaussian Processes
Marc Deisenroth
Department of ComputingImperial College London
http://wp.doc.ic.ac.uk/sml/marc-deisenroth
Gaussian Process Summer School, University of Sheffield15th September 2015
![Page 2: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/2.jpg)
Table of Contents
Gaussian Processes
Sparse Gaussian Processes
Distributed Gaussian Processes
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 2
![Page 3: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/3.jpg)
Problem Setting
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
−2
0
2
x
f(x)
Objective
For a set of N observations yi “ f pxiq ` ε, ε „ N`
0, σ2ε
˘
, find adistribution over functions pp f |X, yq that explains the data
GP is a good solution to this probabilistic regression problem
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 3
![Page 4: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/4.jpg)
GP Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood ppy|X, θqwith respect tothe hyper-parameters θ: θ˚ P arg maxθ log ppy|X, θq
log ppy|X, θq “ ´12 yJK´1y ´ 1
2 log |K| ` const
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization possible:
B log ppy|X, θq
Bθ“ 1
2 yJK´1 BKBθ
K´1y´ 12 tr
`
K´1 BKBθ
˘
§ Computational complexity: OpN3q for |K| and K´1
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4
![Page 5: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/5.jpg)
GP Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood ppy|X, θqwith respect tothe hyper-parameters θ: θ˚ P arg maxθ log ppy|X, θq
log ppy|X, θq “ ´12 yJK´1y ´ 1
2 log |K| ` const
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization possible:
B log ppy|X, θq
Bθ“ 1
2 yJK´1 BKBθ
K´1y´ 12 tr
`
K´1 BKBθ
˘
§ Computational complexity: OpN3q for |K| and K´1
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4
![Page 6: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/6.jpg)
GP Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood ppy|X, θqwith respect tothe hyper-parameters θ: θ˚ P arg maxθ log ppy|X, θq
log ppy|X, θq “ ´12 yJK´1y ´ 1
2 log |K| ` const
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization possible:
B log ppy|X, θq
Bθ“ 1
2 yJK´1 BKBθ
K´1y´ 12 tr
`
K´1 BKBθ
˘
§ Computational complexity: OpN3q for |K| and K´1
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4
![Page 7: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/7.jpg)
GP Predictions
At a test point x˚ the predictive (posterior) distribution is Gaussian:
pp f px˚q|x˚, X, y, θq “ N`
f˚ |m˚, σ2˚
˘
m˚ “ kpX, x˚qJK´1y
σ2˚ “ kpx˚, x˚q ´ kpX, x˚qJK´1kpX, x˚q
When you cache K´1 and K´1y after training, then
§ The mean prediction can be computed in OpNq§ The variance prediction can be computed in OpN2q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 5
![Page 8: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/8.jpg)
GP Predictions
At a test point x˚ the predictive (posterior) distribution is Gaussian:
pp f px˚q|x˚, X, y, θq “ N`
f˚ |m˚, σ2˚
˘
m˚ “ kpX, x˚qJK´1y
σ2˚ “ kpx˚, x˚q ´ kpX, x˚qJK´1kpX, x˚q
When you cache K´1 and K´1y after training, then
§ The mean prediction can be computed in OpNq§ The variance prediction can be computed in OpN2q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 5
![Page 9: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/9.jpg)
Application Areas
§ Bayesian Optimization (Experimental Design)Model unknown utility functions with GPs
§ Reinforcement Learning and RoboticsModel value functions and/or dynamics with GPs
§ Data visualizationNonlinear dimensionality reduction (GP-LVM)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 6
![Page 10: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/10.jpg)
Limitations of Gaussian Processes
Computational and memory complexity
§ Training scales in OpN3q
§ Prediction (variances) scales in OpN2q
§ Memory requirement: OpND` N2q
Practical limit N « 10, 000
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 7
![Page 11: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/11.jpg)
Table of Contents
Gaussian Processes
Sparse Gaussian Processes
Distributed Gaussian Processes
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 8
![Page 12: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/12.jpg)
GP Factor Graph
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values Training function values fi
§ Probabilistic graphical model (factor graph) of a GP§ All function values are jointly Gaussian distributed (e.g., training
and test function values)
§ GP prior
pp f , f˚q “ N˜«
00
ff
,
«
K f f K f˚
K˚ f K˚˚
ff¸
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 9
![Page 13: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/13.jpg)
GP Factor Graph
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values Training function values fi
§ Probabilistic graphical model (factor graph) of a GP§ All function values are jointly Gaussian distributed (e.g., training
and test function values)§ GP prior
pp f , f˚q “ N˜«
00
ff
,
«
K f f K f˚
K˚ f K˚˚
ff¸
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 9
![Page 14: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/14.jpg)
Inducing Variables
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values Training function values fi
Hypothetical function values uj
§ Introduce inducing function values fu
“Hypothetical” function values
§ All function values are still jointly Gaussian distributed (e.g.,training, test and inducing function values)
§ Approach: “Compress” real function values into inducingfunction values
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10
![Page 15: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/15.jpg)
Inducing Variables
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values Training function values fi
Hypothetical function values uj
§ Introduce inducing function values fu
“Hypothetical” function values§ All function values are still jointly Gaussian distributed (e.g.,
training, test and inducing function values)§ Approach: “Compress” real function values into inducing
function valuesDistributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10
![Page 16: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/16.jpg)
Central Approximation Scheme
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values
§ Approximation: Training and test set are conditionallyindependent given the inducing function values: f KK f˚| fu
§ Then, the effective GP prior is
qp f , f˚q “ż
pp f | fuqpp f˚| fuqpp fuqd fu “ N
¨
˝
«
00
ff
,
»
–
K f f Q f˚
Q˚ f K˚˚
fi
fl
˛
‚ ,
Q˚ f :“ K˚ fu K´1fu fu
K fu f Nystrom approximation
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 11
![Page 17: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/17.jpg)
Central Approximation Scheme
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values
§ Approximation: Training and test set are conditionallyindependent given the inducing function values: f KK f˚| fu
§ Then, the effective GP prior is
qp f , f˚q “ż
pp f | fuqpp f˚| fuqpp fuqd fu “ N
¨
˝
«
00
ff
,
»
–
K f f Q f˚
Q˚ f K˚˚
fi
fl
˛
‚ ,
Q˚ f :“ K˚ fu K´1fu fu
K fu f Nystrom approximation
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 11
![Page 18: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/18.jpg)
FI(T)C Sparse Approximation
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values
§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)
§ Effective GP prior with this approximation
qp f , f˚q “ N
¨
˝
«
00
ff
,
»
–
Q f f ´ diagpQ f f ´K f f q Q f˚
Q˚ f K˚˚
fi
fl
˛
‚
§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC
§ Training: OpNM2q, Prediction: OpM2q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12
![Page 19: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/19.jpg)
FI(T)C Sparse Approximation
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values
§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)
§ Effective GP prior with this approximation
qp f , f˚q “ N
¨
˝
«
00
ff
,
»
–
Q f f ´ diagpQ f f ´K f f q Q f˚
Q˚ f K˚˚
fi
fl
˛
‚
§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC
§ Training: OpNM2q, Prediction: OpM2q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12
![Page 20: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/20.jpg)
FI(T)C Sparse Approximation
f(x1) f(xN ) f∗1 f∗
L
Training data Test data
f(u1) f(uM )
Inducing function values
§ Assume that training (and test sets) arefully independent given the inducingvariables (Snelson & Ghahramani, 2006)
§ Effective GP prior with this approximation
qp f , f˚q “ N
¨
˝
«
00
ff
,
»
–
Q f f ´ diagpQ f f ´K f f q Q f˚
Q˚ f K˚˚
fi
fl
˛
‚
§ Q˚˚ ´ diagpQ˚˚ ´K˚˚q can be used instead of K˚˚ FIC
§ Training: OpNM2q, Prediction: OpM2q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12
![Page 21: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/21.jpg)
Inducing Inputs
§ FI(T)C sparse approximation exploits inducing function valuesf puiq, where ui are the corresponding inputs
§ These inputs are unknown a priori Find “optimal” ones
§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs (and the standardhyper-parameters):
u˚1:M P arg maxu1:M
qFITCpy|X, u1:M, θq
§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.
§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13
![Page 22: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/22.jpg)
Inducing Inputs
§ FI(T)C sparse approximation exploits inducing function valuesf puiq, where ui are the corresponding inputs
§ These inputs are unknown a priori Find “optimal” ones
§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs (and the standardhyper-parameters):
u˚1:M P arg maxu1:M
qFITCpy|X, u1:M, θq
§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.
§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13
![Page 23: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/23.jpg)
Inducing Inputs
§ FI(T)C sparse approximation exploits inducing function valuesf puiq, where ui are the corresponding inputs
§ These inputs are unknown a priori Find “optimal” ones
§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs (and the standardhyper-parameters):
u˚1:M P arg maxu1:M
qFITCpy|X, u1:M, θq
§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.
§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13
![Page 24: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/24.jpg)
Inducing Inputs
§ FI(T)C sparse approximation exploits inducing function valuesf puiq, where ui are the corresponding inputs
§ These inputs are unknown a priori Find “optimal” ones
§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs (and the standardhyper-parameters):
u˚1:M P arg maxu1:M
qFITCpy|X, u1:M, θq
§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.
§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13
![Page 25: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/25.jpg)
Inducing Inputs
§ FI(T)C sparse approximation exploits inducing function valuesf puiq, where ui are the corresponding inputs
§ These inputs are unknown a priori Find “optimal” ones
§ Find them by maximizing the FI(T)C marginal likelihood withrespect to the inducing inputs (and the standardhyper-parameters):
u˚1:M P arg maxu1:M
qFITCpy|X, u1:M, θq
§ Intuitively: The marginal likelihood is not only parameterized bythe hyper-parameters θ, but also by the inducing inputs u1:M.
§ End up with a high-dimensional non-convex optimizationproblem with MD additional parameters
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13
![Page 26: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/26.jpg)
FITC Example
Figure from Ed Snelson
§ Pink: Original data
§ Red crosses: Initialization ofinducing inputs
§ Blue crosses: Location of inducinginputs after optimization
§ Efficient compression of the original data set
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 14
![Page 27: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/27.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 28: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/28.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 29: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/29.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 30: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/30.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 31: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/31.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 32: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/32.jpg)
Summary Sparse Gaussian Processes
§ Sparse approximations typically approximate a GP with N datapoints by a model with M ! N data points
§ Selection of these M data points can be tricky and may involvenon-trivial computations (e.g., optimizing inducing inputs)
§ Simple (random) subset selection is fast and generally robust(Chalupka et al., 2013)
§ Computational complexity: OpM3q or OpNM2q for training
§ Practical limit M ď 104. Often: M P Op102q in the case ofinducing variables
§ If we set M “ N{100, i.e., each inducing function valuesummarizes 100 real function values, our practical limit isN P Op106q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15
![Page 33: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/33.jpg)
Table of Contents
Gaussian Processes
Sparse Gaussian Processes
Distributed Gaussian Processes
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 16
![Page 34: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/34.jpg)
Distributed Gaussian Processes
Joint work with Jun Wei Ng
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 17
![Page 35: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/35.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 36: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/36.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 37: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/37.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 38: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/38.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 39: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/39.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 40: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/40.jpg)
An Orthogonal Approximation: Distributed GPs
−1
−1 −1−1−1−1
=
Standard GP
Data set Kernelmatrix
Distributed GP
GP GP GP GP
O(N 3)
O(MP 3)
§ Randomly split the full data set into M chunks
§ Place M independent GP models (experts) on these small chunks
§ Independent computations can be distributed
§ Block-diagonal approximation of kernel matrix K
§ Combine independent computations to an overall result
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18
![Page 41: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/41.jpg)
Training the Distributed GP
§ Split data set of size N into M chunks of size P
§ Independence of experts Factorization of marginal likelihood:
log ppy|X, θq «ÿM
k“1log pkpypkq|Xpkq, θq
§ Distributed optimization and training straightforward
§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines
§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19
![Page 42: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/42.jpg)
Training the Distributed GP
§ Split data set of size N into M chunks of size P
§ Independence of experts Factorization of marginal likelihood:
log ppy|X, θq «ÿM
k“1log pkpypkq|Xpkq, θq
§ Distributed optimization and training straightforward
§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines
§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19
![Page 43: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/43.jpg)
Training the Distributed GP
§ Split data set of size N into M chunks of size P
§ Independence of experts Factorization of marginal likelihood:
log ppy|X, θq «ÿM
k“1log pkpypkq|Xpkq, θq
§ Distributed optimization and training straightforward
§ Computational complexity: OpMP3q [instead of OpN3q]But distributed over many machines
§ Memory footprint: OpMP2 ` NDq [instead of OpN2 ` NDq]
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19
![Page 44: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/44.jpg)
Empirical Training Time
103 104 105 106 107 108
Training data set size
10-2
10-1
100
101
102
103
Com
puta
tion t
ime o
f N
LML
and its
gra
die
nts
(s)
101
102
103
104
105
106
107
Num
ber
of
GP e
xpert
s (D
GP)
Computation time (DGP)
Computation time (Full GP)
Computation time (FITC)
Number of GP experts (DGP)
§ NLML is proportional to training time
§ Full GP (16K training points) « sparse GP (50K training points)« distributed GP (16M training points)
Push practical limit by order(s) of magnitude
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 20
![Page 45: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/45.jpg)
Empirical Training Time
103 104 105 106 107 108
Training data set size
10-2
10-1
100
101
102
103
Com
puta
tion t
ime o
f N
LML
and its
gra
die
nts
(s)
101
102
103
104
105
106
107
Num
ber
of
GP e
xpert
s (D
GP)
Computation time (DGP)
Computation time (Full GP)
Computation time (FITC)
Number of GP experts (DGP)
§ NLML is proportional to training time
§ Full GP (16K training points) « sparse GP (50K training points)« distributed GP (16M training points)
Push practical limit by order(s) of magnitudeDistributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 20
![Page 46: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/46.jpg)
Practical Training Times
§ Training* with N “ 106, D “ 1 on a laptop: « 10–30 min
§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours
*: Maximize the marginal likelihood, stop when converged**
**: Convergence often after 30–80 line searches***
***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21
![Page 47: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/47.jpg)
Practical Training Times
§ Training* with N “ 106, D “ 1 on a laptop: « 10–30 min
§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours
*: Maximize the marginal likelihood, stop when converged**
**: Convergence often after 30–80 line searches***
***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21
![Page 48: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/48.jpg)
Practical Training Times
§ Training* with N “ 106, D “ 1 on a laptop: « 10–30 min
§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours
*: Maximize the marginal likelihood, stop when converged**
**: Convergence often after 30–80 line searches***
***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21
![Page 49: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/49.jpg)
Practical Training Times
§ Training* with N “ 106, D “ 1 on a laptop: « 10–30 min
§ Training* with N “ 5ˆ 106, D “ 8 on a workstation: « 4 hours
*: Maximize the marginal likelihood, stop when converged**
**: Convergence often after 30–80 line searches***
***: Line search « 2–3 evaluations of marginal likelihood andits gradient (usually OpN3q)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21
![Page 50: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/50.jpg)
Predictions with the Distributed GP
µ, σ
§ Prediction of each GP expert is Gaussian N`
µi, σ2i
˘
§ How to combine them to an overall prediction N`
µ, σ2˘
?
Product-of-GP-experts
§ PoE (product of experts; Ng & Deisenroth, 2014)
§ gPoE (generalized product of experts; Cao & Fleet, 2014)
§ BCM (Bayesian Committee Machine; Tresp, 2000)
§ rBCM (robust BCM; Deisenroth & Ng, 2015)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 22
![Page 51: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/51.jpg)
Predictions with the Distributed GP
µ, σ
§ Prediction of each GP expert is Gaussian N`
µi, σ2i
˘
§ How to combine them to an overall prediction N`
µ, σ2˘
?
Product-of-GP-experts
§ PoE (product of experts; Ng & Deisenroth, 2014)
§ gPoE (generalized product of experts; Cao & Fleet, 2014)
§ BCM (Bayesian Committee Machine; Tresp, 2000)
§ rBCM (robust BCM; Deisenroth & Ng, 2015)Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 22
![Page 52: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/52.jpg)
Objectives
µ, σ
µ, σ
µ1, σ1 µ2, σ2
µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23
Figure: Two computational graphs
§ Scale to large data sets 3
§ Good approximation of full GP (“ground truth”)
§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,
cluster, ...)
§ Reasonable predictive variances
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23
![Page 53: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/53.jpg)
Objectives
µ, σ
µ, σ
µ1, σ1 µ2, σ2
µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23
Figure: Two computational graphs
§ Scale to large data sets 3
§ Good approximation of full GP (“ground truth”)
§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,
cluster, ...)
§ Reasonable predictive variances
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23
![Page 54: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/54.jpg)
Objectives
µ, σ
µ, σ
µ1, σ1 µ2, σ2
µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23
Figure: Two computational graphs
§ Scale to large data sets 3
§ Good approximation of full GP (“ground truth”)
§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,
cluster, ...)
§ Reasonable predictive variances
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23
![Page 55: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/55.jpg)
Objectives
µ, σ
µ, σ
µ1, σ1 µ2, σ2
µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23
Figure: Two computational graphs
§ Scale to large data sets 3
§ Good approximation of full GP (“ground truth”)
§ Predictions independent of computational graphRuns on heterogeneous computing infrastructures (laptop,
cluster, ...)
§ Reasonable predictive variances
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23
![Page 56: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/56.jpg)
Running Example
Investigate various product-of-experts modelsSame training procedure, but different mechanisms for predictions
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 24
![Page 57: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/57.jpg)
Product of GP Experts
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pkp f˚|x˚,Dpkqq ,
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision (inverse variance) and mean:
pσpoe˚ q´2 “
ÿ
kσ´2
k px˚q
µpoe˚ “ pσ
poe˚ q2
ÿ
kσ´2
k px˚qµkpx˚q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 25
![Page 58: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/58.jpg)
Product of GP Experts
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pkp f˚|x˚,Dpkqq ,
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision (inverse variance) and mean:
pσpoe˚ q´2 “
ÿ
kσ´2
k px˚q
µpoe˚ “ pσ
poe˚ q2
ÿ
kσ´2
k px˚qµkpx˚q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 25
![Page 59: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/59.jpg)
Computational Graph
GP Experts
PoE
PoE
Prediction:
pp f˚|x˚,Dq9Mź
k“1
pkp f˚|x˚,Dpkqq
Multiplication is associative: a ˚ b ˚ c ˚ d “ pa ˚ bq ˚ pc ˚ dq
Mź
k“1
pkp f˚|Dpkqq “Lź
k“1
Lkź
i“1
pkip f˚|Dpkiqq ,ÿ
k
Lk “ M
Independent of computational graph 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 26
![Page 60: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/60.jpg)
Computational Graph
GP Experts
PoE
PoE
Prediction:
pp f˚|x˚,Dq9Mź
k“1
pkp f˚|x˚,Dpkqq
Multiplication is associative: a ˚ b ˚ c ˚ d “ pa ˚ bq ˚ pc ˚ dq
Mź
k“1
pkp f˚|Dpkqq “Lź
k“1
Lkź
i“1
pkip f˚|Dpkiqq ,ÿ
k
Lk “ M
Independent of computational graph 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 26
![Page 61: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/61.jpg)
Product of GP Experts
§ Unreasonable variances for M ą 1:
pσpoe˚ q´2 “
ÿ
kσ´2
k px˚q
§ The more experts the more certain the prediction, even if everyexpert itself is very uncertain 7 Cannot fall back to the prior
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 27
![Page 62: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/62.jpg)
Product of GP Experts
§ Unreasonable variances for M ą 1:
pσpoe˚ q´2 “
ÿ
kσ´2
k px˚q
§ The more experts the more certain the prediction, even if everyexpert itself is very uncertain 7 Cannot fall back to the prior
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 27
![Page 63: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/63.jpg)
Generalized Product of GP Experts
§ Weight the responsiblity of each expert in PoE with βk
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision and mean:
pσgpoe˚ q´2 “
ÿ
kβkσ´2
k px˚q
µgpoe˚ “ pσ
gpoe˚ q2
ÿ
kβkσ´2
k px˚q µkpx˚q
§ Withř
k βk “ 1, the model can fall back to the prior 3
“Log-opinion pool” model (Heskes, 1998)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28
![Page 64: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/64.jpg)
Generalized Product of GP Experts
§ Weight the responsiblity of each expert in PoE with βk
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision and mean:
pσgpoe˚ q´2 “
ÿ
kβkσ´2
k px˚q
µgpoe˚ “ pσ
gpoe˚ q2
ÿ
kβkσ´2
k px˚q µkpx˚q
§ Withř
k βk “ 1, the model can fall back to the prior 3
“Log-opinion pool” model (Heskes, 1998)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28
![Page 65: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/65.jpg)
Generalized Product of GP Experts
§ Weight the responsiblity of each expert in PoE with βk
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision and mean:
pσgpoe˚ q´2 “
ÿ
kβkσ´2
k px˚q
µgpoe˚ “ pσ
gpoe˚ q2
ÿ
kβkσ´2
k px˚q µkpx˚q
§ Withř
k βk “ 1, the model can fall back to the prior 3
“Log-opinion pool” model (Heskes, 1998)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28
![Page 66: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/66.jpg)
Generalized Product of GP Experts
§ Weight the responsiblity of each expert in PoE with βk
§ Prediction model (independent predictors):
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq
pkp f˚|x˚,Dpkqq “ N`
f˚ | µkpx˚q, σ2k px˚q
˘
§ Predictive precision and mean:
pσgpoe˚ q´2 “
ÿ
kβkσ´2
k px˚q
µgpoe˚ “ pσ
gpoe˚ q2
ÿ
kβkσ´2
k px˚q µkpx˚q
§ Withř
k βk “ 1, the model can fall back to the prior 3
“Log-opinion pool” model (Heskes, 1998)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28
![Page 67: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/67.jpg)
Computational Graph
GP Experts
gPoE
PoE
PoE
Prediction:
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq “Lź
k“1
Lkź
i“1
pβki
kip f˚|Dpkiqq ,
ÿ
k,i
βki “ 1
§ Independent of computational graph ifř
k,i βki “ 1 3
§ A priori setting of βki required 7
βki “ 1{M a priori (3)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29
![Page 68: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/68.jpg)
Computational Graph
GP Experts
gPoE
PoE
PoE
Prediction:
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq “Lź
k“1
Lkź
i“1
pβki
kip f˚|Dpkiqq ,
ÿ
k,i
βki “ 1
§ Independent of computational graph ifř
k,i βki “ 1 3
§ A priori setting of βki required 7
βki “ 1{M a priori (3)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29
![Page 69: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/69.jpg)
Computational Graph
GP Experts
gPoE
PoE
PoE
Prediction:
pp f˚|x˚,Dq9Mź
k“1
pβk
k p f˚|x˚,Dpkqq “Lź
k“1
Lkź
i“1
pβki
kip f˚|Dpkiqq ,
ÿ
k,i
βki “ 1
§ Independent of computational graph ifř
k,i βki “ 1 3
§ A priori setting of βki required 7
βki “ 1{M a priori (3)
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29
![Page 70: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/70.jpg)
Generalized Product of GP Experts
§ Same mean as PoE§ Model no longer overconfident and falls back to prior 3
§ Very conservative variances 7Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 30
![Page 71: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/71.jpg)
Bayesian Committee Machine
§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)
§ Prediction model (Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
§ Predictive precision and mean:
pσbcm˚ q´2 “
ÿM
k“1σ´2
k px˚q ´pM´ 1qσ´2˚˚
µbcm˚ “ pσbcm
˚ q2ÿM
k“1σ´2
k px˚qµkpx˚q
§ Product of GP experts, divided by M´ 1 times the prior
§ Guaranteed to fall back to the prior outside data regime 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31
![Page 72: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/72.jpg)
Bayesian Committee Machine
§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)
§ Prediction model (Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
§ Predictive precision and mean:
pσbcm˚ q´2 “
ÿM
k“1σ´2
k px˚q ´pM´ 1qσ´2˚˚
µbcm˚ “ pσbcm
˚ q2ÿM
k“1σ´2
k px˚qµkpx˚q
§ Product of GP experts, divided by M´ 1 times the prior
§ Guaranteed to fall back to the prior outside data regime 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31
![Page 73: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/73.jpg)
Bayesian Committee Machine
§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)
§ Prediction model (Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
§ Predictive precision and mean:
pσbcm˚ q´2 “
ÿM
k“1σ´2
k px˚q ´pM´ 1qσ´2˚˚
µbcm˚ “ pσbcm
˚ q2ÿM
k“1σ´2
k px˚qµkpx˚q
§ Product of GP experts, divided by M´ 1 times the prior
§ Guaranteed to fall back to the prior outside data regime 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31
![Page 74: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/74.jpg)
Bayesian Committee Machine
§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)
§ Prediction model (Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
§ Predictive precision and mean:
pσbcm˚ q´2 “
ÿM
k“1σ´2
k px˚q ´pM´ 1qσ´2˚˚
µbcm˚ “ pσbcm
˚ q2ÿM
k“1σ´2
k px˚qµkpx˚q
§ Product of GP experts, divided by M´ 1 times the prior
§ Guaranteed to fall back to the prior outside data regime 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31
![Page 75: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/75.jpg)
Bayesian Committee Machine
§ Apply Bayes’ theorem when combining predictions (and not onlyfor computing predictions)
§ Prediction model (Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
§ Predictive precision and mean:
pσbcm˚ q´2 “
ÿM
k“1σ´2
k px˚q ´pM´ 1qσ´2˚˚
µbcm˚ “ pσbcm
˚ q2ÿM
k“1σ´2
k px˚qµkpx˚q
§ Product of GP experts, divided by M´ 1 times the prior
§ Guaranteed to fall back to the prior outside data regime 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31
![Page 76: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/76.jpg)
Computational Graph
GP Experts
PoE
PoE
Prior
Prediction:
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
śMk“1 pkp f˚|Dpkqq
pM´1p f˚q“
śLk“1
śLki“1 pkip f˚|Dpkiqq
pM´1p f˚q
Independent of computational graph 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32
![Page 77: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/77.jpg)
Computational Graph
GP Experts
PoE
PoE
Prior
Prediction:
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
śMk“1 pkp f˚|Dpkqq
pM´1p f˚q“
śLk“1
śLki“1 pkip f˚|Dpkiqq
pM´1p f˚q
Independent of computational graph 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32
![Page 78: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/78.jpg)
Computational Graph
GP Experts
PoE
PoE
Prior
Prediction:
pp f˚|x˚,Dq9śM
k“1 pkp f˚|x˚,DpkqqpM´1p f˚q
śMk“1 pkp f˚|Dpkqq
pM´1p f˚q“
śLk“1
śLki“1 pkip f˚|Dpkiqq
pM´1p f˚q
Independent of computational graph 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32
![Page 79: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/79.jpg)
Bayesian Committee Machine
§ Variance estimates are about right 3
§ When leaving the data regime, the BCM can produce junk 7
Robustify
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 33
![Page 80: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/80.jpg)
Robust Bayesian Committee Machine
§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)
§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pβk
k p f˚|x˚,Dpkqqpř
k βk´1p f˚q
§ Predictive precision and mean:
pσrbcm˚ q´2 “
ÿM
k“1βkσ´2
k px˚q `p1´řM
k“1 βkqσ´2˚˚ ,
µrbcm˚ “ pσrbcm
˚ q2ÿ
kβkσ´2
k px˚q µkpx˚q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34
![Page 81: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/81.jpg)
Robust Bayesian Committee Machine
§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)
§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pβk
k p f˚|x˚,Dpkqqpř
k βk´1p f˚q
§ Predictive precision and mean:
pσrbcm˚ q´2 “
ÿM
k“1βkσ´2
k px˚q `p1´řM
k“1 βkqσ´2˚˚ ,
µrbcm˚ “ pσrbcm
˚ q2ÿ
kβkσ´2
k px˚q µkpx˚q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34
![Page 82: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/82.jpg)
Robust Bayesian Committee Machine
§ Merge gPoE (weighting of experts) with the BCM (Bayes’theorem when combining predictions)
§ Prediction model (conditional independence Dpjq KK Dpkq| f˚):
pp f˚|x˚,Dq9śM
k“1 pβk
k p f˚|x˚,Dpkqqpř
k βk´1p f˚q
§ Predictive precision and mean:
pσrbcm˚ q´2 “
ÿM
k“1βkσ´2
k px˚q `p1´řM
k“1 βkqσ´2˚˚ ,
µrbcm˚ “ pσrbcm
˚ q2ÿ
kβkσ´2
k px˚q µkpx˚q
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34
![Page 83: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/83.jpg)
Computational Graph
GP Experts
gPoE
PoE
PoE
Prior
Prediction:
pp f˚|x˚,Dq9śM
k“1 pβk
k p f˚|x˚,Dpkqqpř
k βk´1p f˚q“
śLk“1
śLki“1 p
βki
kip f˚|Dpkiqq
př
kiβki´1p f˚q
Independent of computational graph, even with arbitrary βki 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 35
![Page 84: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/84.jpg)
Computational Graph
GP Experts
gPoE
PoE
PoE
Prior
Prediction:
pp f˚|x˚,Dq9śM
k“1 pβk
k p f˚|x˚,Dpkqqpř
k βk´1p f˚q“
śLk“1
śLki“1 p
βki
kip f˚|Dpkiqq
př
kiβki´1p f˚q
Independent of computational graph, even with arbitrary βki 3
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 35
![Page 85: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/85.jpg)
Robust Bayesian Committee Machine
§ Does not break down in case of weak experts Robustified 3
§ Robust version of BCM Reasonable predictions 3
§ Independent of computational graph (for all choices of βk) 3Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 36
![Page 86: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/86.jpg)
Empirical Approximation Error
100 101 102 103
Gradient time in sec
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
RM
SE
rBCM
BCM
gPoE
PoE
SOD
GP
39 156 625 2500 10000#Points/Expert
§ Simulated robot arm data (10K training, 10K test)§ Hyper-parameters of ground-truth full GP§ RMSE as a function of the training time§ Sparse GP (SOR) performs worse than any distributed GP§ rBCM performs best with “weak” GP experts
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 37
![Page 87: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/87.jpg)
Empirical Approximation Error (2)
100 101 102 103
Gradient time in sec
−3
−2
−1
0
1
2
NL
PD
rBCM
BCM
gPoE
PoE
SOD
GP
39 156 625 2500 10000#Points/Expert
§ NLPD as a function of the training time Mean and variance§ BCM and PoE are not robust for weak experts§ gPoE suffers from too conservative variances§ rBCM consistently outperforms other methods
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 38
![Page 88: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/88.jpg)
Large Data Sets: Airline Data (700K)
rBCM BCM gPoE PoE Dist−VGP SVI−GP
24
26
28
30
32
34DGP, random data assignment
Sparse GP
Airline Delay, 700K
RMSE
§ (r)BCM and (g)PoE with4096 GP experts
§ Gradient time: 13 seconds(12 cores)
§ Inducing inputs:Dist-VGP (Gal et al., 2014),SVI-GP (Hensman et al.,2013)
§ rBCM performs best
§ (g)PoE and BCM perform not worse than sparse GPs
§ KD-tree data assignment clearly helps (r)BCM
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 39
![Page 89: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/89.jpg)
Large Data Sets: Airline Data (700K)
rBCM rBCM−KD BCM BCM−KD gPoE gPoE−KD PoE PoE−KDDist−VGP SVI−GP
24
26
28
30
32
34DGP, KD- tree data assignment
DGP, random data assignment
Sparse GP
Airline Delay, 700K
RMSE
§ (r)BCM and (g)PoE with4096 GP experts
§ Gradient time: 13 seconds(12 cores)
§ Inducing inputs:Dist-VGP (Gal et al., 2014),SVI-GP (Hensman et al.,2013)
§ rBCM performs best
§ (g)PoE and BCM perform not worse than sparse GPs
§ KD-tree data assignment clearly helps (r)BCM
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 39
![Page 90: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/90.jpg)
Summary: Distributed Gaussian Processes
µ, σ
µ1, σ1 µ2, σ2
µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23
§ Scale Gaussian processes to large data (beyond 106)§ Model conceptually straightforward and easy to train§ Key: Distributed computation§ Currently tested with N P Op107q
§ Scales to arbitrarily large data sets (with enough computingpower)
Thank you for your attentionDistributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 40
![Page 91: Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f](https://reader031.fdocuments.us/reader031/viewer/2022041012/5ebe0425ea70ce07b352f2c3/html5/thumbnails/91.jpg)
References
[1] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian ProcessPredictions. http://arxiv.org/abs/1410.7827, October 2014.
[2] K. Chalupka, C. K. I. Williams, and I. Murray. A Framework for Evaluating Approximate Methods for Gaussian ProcessRegression. Journal of Machine Learning Research, 14:333–350, February 2013.
[3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on MachineLearning, 2015.
[4] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-EfficientReinforcement Learning. In Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2011.
[5] Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed Variational Inference in Sparse Gaussian Process Regressionand Latent Variable Models. In Advances in Neural Information Processing Systems. 2014.
[6] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors,Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2013.
[7] T. Heskes. Selecting Weighting Factors in Logarithmic Opinion Pools. In Advances in Neural Information Processing Systems,pages 266–272. Morgan Kaufman, 1998.
[8] N. Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models.Journal of Machine Learning Research, 6:1783–1816, November 2005.
[9] J. Ng and M. P. Deisenroth. Hierarchical Mixture-of-Experts Model for Large-Scale Gaussian Process Regression.http://arxiv.org/abs/1412.3078, December 2014.
[10] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.
[11] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Scholkopf, and J. C. Platt,editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge, MA, USA, 2006.
[12] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics, 2009.
[13] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.
Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 41