Variational Inference in Gaussian Processes for non-linear...

22
Variational Inference in Gaussian Processes for non-linear time series Carl Edward Rasmussen Workshop on Nonlinear System Identification Benchmarks Brussels, Belgium, April 25-27, 2016 with many people from the machine learning group including: Thang Bui, Yutian Chen, Roger Frigola, Rowan McAllister, Andrew McHutchon, Rich Turner and Mark van der Wilk Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 1 / 22

Transcript of Variational Inference in Gaussian Processes for non-linear...

Page 1: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Variational Inference in Gaussian Processesfor non-linear time series

Carl Edward Rasmussen

Workshop on Nonlinear System Identification Benchmarks

Brussels, Belgium, April 25-27, 2016

with many people from the machine learning group including:

Thang Bui, Yutian Chen, Roger Frigola, Rowan McAllister, Andrew McHutchon,Rich Turner and Mark van der Wilk

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 1 / 22

Page 2: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Motivation

Many control problems require that the model of the dynamics be partially orentirely derived from measurements.

Therefore, the dynamics model must be stochastic, to reflect the inevitable lack ofcertainty in model predictions.

Flexible models are required to model a wide variety of dynamics.

We seek automatic model inference and training (fitting) algorithms.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 2 / 22

Page 3: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

An inconvenient truth

x(t-1)

y(t-1)

x(t)

y(t)

Gaussian processes (GP) are an extremely powerful framework for learning andinference about non-linear functions.

Irritating fact: Although desirable, it is not easy to apply GPs to latent variabletime series.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 3 / 22

Page 4: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Non-linear transition model is enough

x(t-1)

y(t-1)

x(t)

y(t)

It is the non-linearity in the transition model which is essential.

A non-linear observation model can be moved to the transition model withoutloss of generality (but possibly at the cost of needing additional state coordinates).

Thus, in the remainder of the talk, we focus on non-linear transition models.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 4 / 22

Page 5: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Gaussian Process

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Gaussian processes (GP) are flexible stochastic models.

A GP specifies a conditional joint over (any number of) function values, given theinputs (index set) p(f (x1), . . . , f (xn)|x1, . . . , xn).

A GP is a non-parametric model; the ’parameters’ of the model is the functionitself. This creates some interesting opportunities and some extra challenges.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 5 / 22

Page 6: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Generative Model

Gaussian Process State Space Model

f (x|θx) ∼ GP(m(x), k(x, x ′)

), (1)

xt|ft ∼ N(xt|ft, Q), Q diagonal, (2)

yt|xt ∼ N(yt|Cxt, R), (3)

with hyperparameters θ = (θx, Q, C, R).

The joint probability is the prior times the likelihood

p(y, x, f|θ) = p(x0)

T∏t=1

p(ft|f1:t−1, x0:t−1, θx)p(xt|ft, Q)p(yt|xt, C, R). (4)

Note that the joint distribution of the f values isn’t even Gaussian! The marginallikelihood is

p(y|θ) =

∫p(y, x, f|θ)dfdx. (5)

That’s really awful! But we need it to train the model.Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 6 / 22

Page 7: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Picture of part of generative process

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x(t)

f(t+1)

x=y

p(x(t+1))

x(t+1)

f(t+2)

f(x)

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 7 / 22

Page 8: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Let’s lower bound the marginal likelihood

log p(y|θ) >∫

q(x, f) logp(x0)

∏Tt=1 p(ft|f1:t−1, x0:t−1)p(xt|ft)p(yt|xt)

q(x, f)dxdf

=

∫q(x, f) log

p(y|θ)p(x, f|y, θ)q(x, f)

dxdf

= log p(y|θ) −KL(q(x, f)||p(x, f|y, θ)).

(6)

for any distribution q(x, f), by Jensen’s inequality.

Let’s chose the q(x, f) distribution within some restricted family to maximize thelower bound or equivalently minimize the KL divergence.

This is still nasty because of the annoying prior, unless we chose:

q(x, f) = q(x)T∏

t=1

p(ft|f1:t−1, x0:t−1).

This choice makes the lower bound simple but terribly loose! Why? Because theapproximating distribution over the f’s doesn’t depend on the observations y.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 8 / 22

Page 9: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Augment GP with inducing variables

Augment the model with an additional set of input output pairs {zi, ui|i = 1, . . . M}

Joint distribution:

p(y, x, f, u|z) = p(x, f|u)p(u|z)T∏

t=1

p(yt|xt). (7)

Consistency (or the marginalization property) of GPs ensures that it is straightforward to augment with extra variables.

This step seemingly makes our problem worse, because we have more latentvariables.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 9 / 22

Page 10: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Lower bound revisited

Lower bound on marginal likelihood

log p(y|θ) >∫q(x, f, u) log

p(u)p(x0)∏T

t=1 p(ft|f1:t−1, x0:t−1, u)p(xt|ft)p(yt|xt)

q(x, f, u)dxdfdu,

(8)

for any distribution q(x, f, u), by Jensen’s inequality.

Now chose

q(x, f, u) = q(u)q(x)T∏

t=1

p(ft|f1:t−1, x0:t−1, u). (9)

This choice conveniently makes the troublesome p(ft|f1:t−1, x0:t−1, u) term cancel.But the u values can be chosen (via q(u)) to reflect the observations, so the boundmay be tight(er).

The lower bound is L(y|q(x), q(u), q(f|x, u), θ).

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 10 / 22

Page 11: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

What do we get for all our troubles?

• For certain choices of covariance function, f can be integrated out

L(y|q(x), q(u), θ) =

∫L(y|q(x), q(u), q(f|x, u), θ)df (10)

• The optimal q(u) is found by calculus is variations and turns out to beGaussian, and can be maxed out

L(y|q(x), θ) = maxq(u)

L(y|q(x), q(u), θ) (11)

• The optimal q(x) has Markovian structure; making a further Gaussianassumption, the lower bound can be evaluated analytically, as a function ofits parameters µt, Σt,t and Σt−1,t.

Algorithm: optimize the lower bound L(y|q(x), θ) wrt the parameters of theGaussian q(x) and the remaining parameters θ (we need derivatives for theoptimisation).

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 11 / 22

Page 12: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Sparse approximations are builtin

The computational cost is dominated by the GP. But, the effective ’training setsize’ for the GP is given by M the number of inducing variables.

Therefore, we can chose M to trade off accuracy and computational demand.

The computational cost is linear in the length of the time-series.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 12 / 22

Page 13: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Data from non-linear Dynamical System

0 10 20 30 40 50 60 70 80 90 100-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 13 / 22

Page 14: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

State space representation: f(t) as a function of f(t-1)

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 14 / 22

Page 15: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

With explicit transitions

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 15 / 22

Page 16: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Data from non-linear Dynamical System

-4 -3 -2 -1 0 1 2-5

-4

-3

-2

-1

0

1

2

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 16 / 22

Page 17: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Parameterisation

Let’s use the method on a sequence of length T, with D dimensional observationsand a D dimensional latent state, ie we have TD observations.

The lower bound is to be optimized wrt all the parameters

• the pairwise marginal Gaussian distribution q(x). This contains TDparameters for the mean and 2TD2 for the covariances.

• the inducing inputs z. For each of the D GPs there are MD inducing inputs,ie a total of MD2.

• parameters of the observation model C, there are D2 of these• noise parameters, 2D.• GP hyperparameters, ∼ D2.

for a grand total of roughly (2T + M)D2.

Example: cart and inverted pendulum, D = 4 and 10 timeseries each of length100, so T = 1000 and M = 50. So we have 4000 observations and 36832 freeparameters. This large number of parameters may be inconvenient but it doesn’tlead to overfitting!

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 17 / 22

Page 18: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Implementation

Careful implementation is necessary

• q(x) is assumed Markovian. Thus the precision matrix is block tridiagonal.The covariance is full, don’t write it out, it’s huge!

• q(x) has to be parameterised carefully to allow all pos def matrices, butwithout writing out the covariances.

• initialization of parameters is important• most of the free parameters are in the covariance of q(x). Initially train

with shared covariances across time.• then continue training with free covariances.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 18 / 22

Page 19: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Conclusions

• Principled, approximate inference in flexible non-linear non-parametricmodels for time series are possible and practical

• Allows to integrate over dynamics• Automatically discover dimensionality of the hidden space – it is not the case

that more latent dimensions lead to higher marginal likelihood

Some other interesting properties include

• Handles missing variables• Multivariate latent variables and observations straight forward• flexible non-parametric dynamics model• Occam’s Razor automatically at work, no overfitting• Framework includes computational control by limiting M

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 19 / 22

Page 20: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

Marginalize, don’t optimize

Model with parameters θy = fθ(x) + ε

Training data setD = {xn, yn}, n = 1, . . . , N

Make a prediction about the output y∗ given a new (test) input x∗:

p(y∗|x∗,D) =

∫p(y∗,θ|x∗,D)dθ

=

∫p(y∗|x∗,θ)p(θ|D)dθ,

where p(θ|D) is the posterior (proportional to the prior times the likelihood).

Notice, no optimization, ’estimation’ or ’fitting’ or ’tuning’ or ’adapting’, etc.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 20 / 22

Page 21: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

too simple

too complex

"just right"

All possible data sets

P(Y

|Mi)

Y

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 21 / 22

Page 22: Variational Inference in Gaussian Processes for non-linear ...mlg.eng.cam.ac.uk/carl/talks/brusselsbench.pdf · Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels,

• Bayesian inference is not generally equivalent to regularization.• Maximising likelihood times prior is Penalized Maximum Likelihood• problem: integrals are difficult.

Rasmussen (Engineering, Cambridge) Var Inf in GP for time series Brussels, Belgium, April 25-27, 2016 22 / 22