Fitting Covariance and Multioutput Gaussian...

Fitting Covariance and Multioutput GaussianProcesses

Neil D. Lawrence

GPSS16th September 2014

Outline

Parametric Models are a Bottleneck

Constructing Covariance

GP Limitations

Kalman Filter

Nonparametric Gaussian Processes

I We’ve seen how we go from parametric to non-parametric.I The limit implies infinite dimensional w.I Gaussian processes are generally non-parametric: combine

data with covariance function to get model.I This representation cannot be summarized by a parameter

vector of a fixed size.

The Parametric Bottleneck

I Parametric models have a representation that does notrespond to increasing training set size.

I Bayesian posterior distributions over parameters containthe information about the training data.

I Use Bayes’ rule from training data, p(w|y,X

),

I Make predictions on test data

p(y∗|X∗,y,X

)=

∫p(y∗|w,X∗

)p(w|y,X)dw

).

I w becomes a bottleneck for information about the trainingset to pass to the test set.

I Solution: increase m so that the bottleneck is so large that itno longer presents a problem.

I How big is big enough for m? Non-parametrics saysm→∞.


I Now no longer possible to manipulate the model throughthe standard parametric form.

I However, it is possible to express parametric as GPs:

k(xi, x j

)= φ: (xi)

> φ:

(x j

).

I These are known as degenerate covariance matrices.I Their rank is at most m, non-parametric models have full

rank covariance matrices.I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.




k(xi, x j

)= φ: (xi)

> φ:

(x j

).

I These are known as degenerate covariance matrices.

I Their rank is at most m, non-parametric models have fullrank covariance matrices.

I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.




k(xi, x j

)= φ: (xi)

> φ:

(x j

).


rank covariance matrices.

I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.




k(xi, x j

)= φ: (xi)

> φ:

(x j

).


rank covariance matrices.I Most well known is the “linear kernel”, k(xi, x j) = x>i x j.

Making Predictions

I For non-parametrics prediction at new points f∗ is made byconditioning on f in the joint distribution.

I In GPs this involves combining the training data with thecovariance function and the mean function.

I Parametric is a special case when conditional predictioncan be summarized in a fixed number of parameters.

I Complexity of parametric model remains fixed regardlessof the size of our training data set.

I For a non-parametric model the required number ofparameters grows with the size of the training data.

Covariance Functions and Mercer Kernels

I Mercer Kernels and Covariance Functions are similar.

I the kernel perspective does not make a probabilisticinterpretation of the covariance function.

I Algorithms can be simpler, but probabilistic interpretationis crucial for kernel parameter optimization.


I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

interpretation of the covariance function.

I Algorithms can be simpler, but probabilistic interpretationis crucial for kernel parameter optimization.


I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation

is crucial for kernel parameter optimization.

Outline



GP Limitations

Kalman Filter

Constructing Covariance Functions

I Sum of two covariances is also a covariance function.

k(x, x′) = k1(x, x′) + k2(x, x′)

Constructing Covariance Functions

I Product of two covariances is also a covariance function.

k(x, x′) = k1(x, x′)k2(x, x′)

Multiply by Deterministic Function

I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then

kh(x, x′) = g(x)k f (x, x′)g(x′)

where kh is covariance for h(·) and k f is covariance for f (·).

Covariance Functions

MLP Covariance Function

k (x, x′) = αasin(

wx>x′ + b√

wx>x + b + 1√

wx′>x′ + b + 1

)

I Based on infinite neuralnetwork model.

w = 40

b = 4

Covariance Functions

Linear Covariance Function

k (x, x′) = αx>x′

I Bayesian linearregression.

α = 1

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure : Real example: BACCO (see e.g. (Oakley and O’Hagan, 2002)).Interpolation through outputs from slow computer simulations (e.g.atmospheric carbon levels).

Gaussian Noise

I Gaussian noise model,

p(yi| fi

)= N

(yi| fi, σ2

)where σ2 is the variance of the noise.

I Equivalent to a covariance function of the form

k(xi, x j) = δi, jσ2

where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add

this term to existing covariance matrices.

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure : Examples include WiFi localization, C14 callibration curve.

General Noise Models

Graph of a GPI Relates input variables,

X, to vector, y, through fgiven kernel parametersθ.

I Plate notation indicatesindependence of yi| fi.

I In general p(yi| fi

)is

non-Gaussian.I We approximate with

Gaussianp(yi| fi

)≈ N

(mi| fi, β−1

i

).

yi

X

fi

θ

i = 1 . . . n

Figure : The Gaussian processdepicted graphically.

Gaussian Noise

0

1

2

-3 -2 -1 0 1 2 3 4

p(

f∗|X, x∗,y)

Figure : Inclusion of a data point with Gaussian noise.

Gaussian Noise

0

1

2

-3 -2 -1 0 1 2 3 4

p(

f∗|X, x∗,y)

p(y∗ = 0.6| f∗

)


Gaussian Noise

0

1

2

-3 -2 -1 0 1 2 3 4

p(

f∗|X, x∗,y)

p(y∗ = 0.6| f∗

)p(

f∗|X, x∗,y, y∗)


Expectation Propagation

Local Moment Matching

I Easiest to consider a single previously unseen data point,y∗, x∗.

I Before seeing data point, prediction of f∗ is a GP, q(

f∗|y,X).

I Update prediction using Bayes’ Rule,

p(

f∗|y, y∗,X, x∗)

=p(y∗| f∗

)p(

f∗|y,X, x∗)

p(y, y∗|X, x∗

) .

This posterior is not a Gaussian process if p(y∗| f∗

)is

non-Gaussian.

Classification Noise Model

Probit Noise Model

0

0.5

1

-4 -2 0 2 4

p(y i|f

i)

fi

yi = −1 yi = 1

Figure : The probit model (classification). The plot shows p(yi| fi

)for

different values of yi. For yi = 1 we have

p(yi| fi

)= φ

(fi)

=∫ fi−∞N (z|0, 1) dz.

Expectation Propagation II

Match Moments

I Idea behind EP — approximate with a Gaussian process atthis stage by matching moments.

I This is equivalent to minimizing the following KLdivergence where q

(f∗|y, y∗,X, x∗

)is constrained to be a GP.

q(

f∗|y, y∗X, x∗)

= argminq( f∗ |y,y∗X,x∗)KL(p(

f∗|y, y∗X, x∗)||q

(f∗|y, y∗,X, x∗

))I This is equivalent to setting⟨

f∗⟩

q( f∗|y,y∗,X,x∗) =⟨

f∗⟩

p( f∗|y,y∗,X,x∗)⟨f 2∗

⟩q( f∗|y,y∗,X,x∗)

=⟨

f 2∗

⟩p( f∗|y,y∗,X,x∗)

Classification

0

1

2

3

-3 -2 -1 0 1 2 3

p(

f∗|X, x∗,y)

Figure : An EP style update with a classification noise model.

Classification

0

1

2

3

-3 -2 -1 0 1 2 3

p(

f∗|X, x∗,y)

p(y∗ = 1| f∗

)


Classification

0

1

2

3

-3 -2 -1 0 1 2 3

p(

f∗|X, x∗,y)

p(y∗ = 1| f∗

)p(

f∗|X, x∗,y, y∗)


Classification

0

1

2

3

-3 -2 -1 0 1 2 3

p(

f∗|X, x∗,y)

p(y∗ = 1| f∗

)p(

f∗|X, x∗,y, y∗)

q(

f∗|X, x∗,y)


Ordinal Noise Model

Ordered Categories

0

0.5

1

-4 -2 0 2 4

p(y i|f

i)

fi

yi = −1 yi = 1yi = 0

Figure : The ordered categorical noise model (ordinal regression).The plot shows p

(yi| fi

)for different values of yi. Here we have

assumed three categories.

Laplace Approximation

I Equivalent Gaussian is found by making a local 2nd orderTaylor approximation at the mode.

I Laplace was the first to suggest this1, so it’s known as theLaplace approximation.

Learning Covariance ParametersCan we determine covariance parameters from the data?

N(y|0,K

)=

1

(2π)n2 |K|

12exp

(−

y>K−1y2

)

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)


logN(y|0,K

)=−

12

log |K|−y>K−1y

2−

n2

log 2π




E(θ) =12

log |K| +y>K−1y

2



Eigendecomposition of Covariance

A useful decomposition for understanding the objectivefunction.

K = RΛ2R>

λ1λ2

Diagonal of Λ represents distancealong axes.R gives a rotation of these axes.

where Λ is a diagonal matrix and R>R = I.

Useful representation since |K| =∣∣∣Λ2

∣∣∣ = |Λ|2.

Capacity control: log |K|

λ1 0

0 λ2

λ1

Λ =


λ1 0

0 λ2

λ1

λ2Λ =


|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2Λ =


|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2 |Λ|Λ =


|Λ| = λ1λ2

λ1 0 0

0 λ2 0

0 0 λ3λ1

λ2 |Λ|Λ =


|Λ| = λ1λ2λ3

λ1 0 0

0 λ2 0

0 0 λ3λ1

λ2

λ3

|Λ|Λ =


|Λ| = λ1λ2

λ1 0

0 λ2

λ1

λ2 |Λ|Λ =


|RΛ| = λ1λ2

w1,1 w1,2

w2,1 w2,2

λ1λ2

|Λ|RΛ =

Data Fit: y>K−1y2

-6

-4

-2

0

2

4

6

-6 -4 -2 0 2 4 6

y 2

y1

λ1

λ2

Data Fit: y>K−1y2

-6

-4

-2

0

2

4

6

-6 -4 -2 0 2 4 6

y 2

y1

λ1λ2

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| +y>K−1y

2

Gene Expression Example

I Given given expression levels in the form of a time seriesfrom Della Gatta et al. (2008).

I Want to detect if a gene is expressed or not, fit a GP to eachgene (Kalaitzis and Lawrence, 2011).

RESEARCH ARTICLE Open Access

A Simple Approach to Ranking DifferentiallyExpressed Gene Expression Time Courses throughGaussian Process RegressionAlfredo A Kalaitzis* and Neil D Lawrence*

Abstract

Background: The analysis of gene expression from time series underpins many biological studies. Two basic formsof analysis recur for data of this type: removing inactive (quiet) genes from the study and determining whichgenes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data isdrawn from a time series. In this paper we propose a simple model for accounting for the underlying temporalnature of the data based on a Gaussian process.

Results: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in geneexpression time-series. We present a simple approach which can be used to filter quiet genes, or for the case oftime series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankingsproduced by our regression framework and compare them to a recently proposed hierarchical Bayesian model forthe analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showingthat the proposed approach considerably outperforms the current state of the art.

Conclusions: Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis ofmicroarray time series. The Gaussian process framework offers a natural way of handling biological replicates andmissing values and provides confidence intervals along the estimated curves of gene expression. Therefore, webelieve Gaussian processes should be a standard tool in the analysis of gene expression time series.

BackgroundGene expression profiles give a snapshot of mRNA con-centration levels as encoded by the genes of an organ-ism under given experimental conditions. Early studiesof this data often focused on a single point in timewhich biologists assumed to be critical along the generegulation process after the perturbation. However, thestatic nature of such experiments severely restricts theinferences that can be made about the underlying dyna-mical system.With the decreasing cost of gene expression microar-

rays time series experiments have become commonplacegiving a far broader picture of the gene regulation pro-cess. Such time series are often irregularly sampled andmay involve differing numbers of replicates at each timepoint [1]. The experimental conditions under which

gene expression measurements are taken cannot be per-fectly controlled leading the signals of interest to be cor-rupted by noise, either of biological origin or arisingthrough the measurement process.Primary analysis of gene expression profiles is often

dominated by methods targeted at static experiments, i.e. gene expression measured on a single time-point, thattreat time as an additional experimental factor [1-6].However, were possible, it would seem sensible to con-sider methods that can account for the special nature oftime course data. Such methods can take advantage ofthe particular statistical constraints that are imposed ondata that is naturally ordered [7-12].The analysis of gene expression microarray time-series

has been a stepping stone to important problems in sys-tems biology such as the genome-wide identification ofdirect targets of transcription factors [13,14] and the fullreconstruction of gene regulatory networks [15,16]. Amore comprehensive review on the motivations and

* Correspondence: [email protected]; [email protected] Sheffield Institute for Translational Neuroscience, 385A Glossop Road,Sheffield, S10 2HQ, UK

Kalaitzis and Lawrence BMC Bioinformatics 2011, 12:180http://www.biomedcentral.com/1471-2105/12/180

© 2011 Kalaitzis and Lawrence; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

http://www.biomedcentral.com/1471-2105/12/180

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

Contour plot of Gaussian process likelihood.

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-1

-0.5

0

0.5

1

0 50100150200250300

y(x)

x

Optima: length scale of 1.2221 and log10 SNR of 1.9654 loglikelihood is -0.22317.

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-1

-0.5

0

0.5

1

0 50100150200250300

y(x)

x

Optima: length scale of 1.5162 and log10 SNR of 0.21306 loglikelihood is -0.23604.

-2.5-2

-1.5-1

-0.50

0.51

1 1.5 2 2.5 3 3.5

log 10

SNR

log10 length scale

-0.8-0.6-0.4-0.2

00.20.40.60.8

0 50100150200250300

y(x)

x

Optima: length scale of 2.9886 and log10 SNR of -4.506 loglikelihood is -2.1056.

Outline



GP Limitations

Kalman Filter

Limitations of Gaussian Processes

I Inference is O(n3) due to matrix inverse (in practice useCholesky).

I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).

I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).

Outline



GP Limitations

Kalman Filter

Simple Markov Chain

I Assume 1-d latent state, a vector over time, x = [x1 . . . xT].I Markov property,

xi =xi−1 + εi,

εi ∼N (0, α)

=⇒ xi ∼N (xi−1, α)

I Initial state,x0 ∼ N (0, α0)

I If x0 ∼ N (0, α) we have a Markov chain for the latent states.I Markov chain it is specified by an initial distribution

(Gaussian) and a transition distribution (Gaussian).

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x0 = 0.000, ε1 = −2.24

x1 = 0.000 − 2.24 = −2.24

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x1 = −2.24, ε2 = 0.457

x2 = −2.24 + 0.457 = −1.78

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x2 = −1.78, ε3 = 0.178

x3 = −1.78 + 0.178 = −1.6

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x3 = −1.6, ε4 = −0.292

x4 = −1.6 − 0.292 = −1.89

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x4 = −1.89, ε5 = −0.501

x5 = −1.89 − 0.501 = −2.39

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x5 = −2.39, ε6 = 1.32

x6 = −2.39 + 1.32 = −1.08

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x6 = −1.08, ε7 = 0.989

x7 = −1.08 + 0.989 = −0.0881

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x7 = −0.0881, ε8 = −0.842

x8 = −0.0881 − 0.842 = −0.93

Gauss Markov Chain

-4

-2

0

2

4

0 1 2 3 4 5 6 7 8 9

x

t

x0 = 0, εi ∼ N (0, 1)

x8 = −0.93, ε9 = −0.41

x9 = −0.93 − 0.410 = −1.34

Multivariate Gaussian Properties: Reminder

Ifz ∼ N

(µ,C

)and

x = Wz + b

thenx ∼ N

(Wµ + b,WCW>

)

Multivariate Gaussian Properties: Reminder

Simplified: Ifz ∼ N

(0, σ2I

)and

x = Wz

thenx ∼ N

(0, σ2WW>

)

Matrix Representation of Latent Variables

x1

x2

x3

x4

x5

ε1

ε2

ε3

ε4

ε5

1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1

×=

x1 = ε1


x1

x2

x3

x4

x5

ε1

ε2

ε3

ε4

ε5

1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1

×=

x2 = ε1 + ε2


x1

x2

x3

x4

x5

ε1

ε2

ε3

ε4

ε5

1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1

×=

x3 = ε1 + ε2 + ε3


x1

x2

x3

x4

x5

ε1

ε2

ε3

ε4

ε5

1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1

×=

x4 = ε1 + ε2 + ε3 + ε4


x1

x2

x3

x4

x5

ε1

ε2

ε3

ε4

ε5

1 0 0 0 01 1 0 0 01 1 1 0 01 1 1 1 01 1 1 1 1

×=

x5 = ε1 + ε2 + ε3 + ε4 + ε5


x εL1 ×=

Multivariate Process

I Since x is linearly related to εwe know x is a Gaussianprocess.

I Trick: we only need to compute the mean and covarianceof x to determine that Gaussian.

Latent Process Mean

x = L1ε

Latent Process Mean

〈x〉 = 〈L1ε〉

Latent Process Mean

〈x〉 = L1 〈ε〉

Latent Process Mean

〈x〉 = L1 〈ε〉

ε ∼ N (0, αI)

Latent Process Mean

〈x〉 = L10

Latent Process Mean

〈x〉 = 0

Latent Process Covariance

xx> = L1εε>L>1x> = ε>L>


⟨xx>

⟩=

⟨L1εε>L>1

⟩


⟨xx>

⟩= L1

⟨εε>

⟩L>1


⟨xx>

⟩= L1

⟨εε>

⟩L>1

ε ∼ N (0, αI)


⟨xx>

⟩= αL1L>1

Latent Process

x = L1ε

Latent Process

x = L1ε

ε ∼ N (0, αI)

Latent Process

x = L1ε

ε ∼ N (0, αI)

=⇒

Latent Process

x = L1ε

ε ∼ N (0, αI)

=⇒

x ∼ N(0, αL1L>1

)

Covariance for Latent Process II

I Make the variance dependent on time interval.I Assume variance grows linearly with time.I Justification: sum of two Gaussian distributed random

variables is distributed as Gaussian with sum of variances.I If variable’s movement is additive over time (as described)

variance scales linearly with time.


I Givenε ∼ N (0, αI) =⇒ ε ∼ N

(0, αL1L>1

).

Thenε ∼ N (0,∆tαI) =⇒ ε ∼ N

(0,∆tαL1L>1

).

where ∆t is the time interval between observations.


ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1

)

K = α∆tL1L>1

ki, j = α∆tl>:,il:, j

where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.

ki, j = α∆t min(i, j)

define ∆ti = ti so

ki, j = αmin(ti, t j) = k(ti, t j)


ε ∼ N (0, α∆tI) , x ∼ N(0, α∆tL1L>1

)K = α∆tL1L>1

ki, j = α∆tl>:,il:, j

where l:,k is a vector from the kth row of L1: the first k elementsare one, the next T − k are zero.

ki, j = α∆t min(i, j)

define ∆ti = ti so

ki, j = αmin(ti, t j) = k(ti, t j)

Covariance FunctionsWhere did this covariance matrix come from?

Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.


Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.

-3-2-10123

0 0.5 1 1.5 2


Markov Process

Visualization of inverse covariance (precision).

I Precision matrix issparse: only neighboursin matrix are non-zero.

I This reflects conditionalindependencies in data.

I In this case Markovstructure.


Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix is

built using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.


Exponentiated Quadratic


I Precision matrix is notsparse.

I Each point is dependenton all the others.

I In this casenon-Markovian.


Markov Process


I Precision matrix issparse: only neighboursin matrix are non-zero.

I This reflects conditionalindependencies in data.

I In this case Markovstructure.

Simple Kalman Filter I

I We have state vector X =[x1 . . . xq

]∈ RT×q and if each state

evolves independently we have

p(X) =

q∏i=1

p(x:,i)

p(x:,i) = N(x:,i|0,K

).

I We want to obtain outputs through:

yi,: = Wxi,:

Stacking and Kronecker Products I

I Represent with a ‘stacked’ system:

p(x) = N (x|0, I ⊗K)

where the stacking is placing each column of X one on topof another as

x =

x:,1x:,2...

x:,q

Kronecker Product

aK bKcK dK

Ka b

c d⊗ =

Kronecker Product

⊗ =

Stacking and Kronecker Products I

I Represent with a ‘stacked’ system:

p(x) = N (x|0, I ⊗K)

where the stacking is placing each column of X one on topof another as

x =

x:,1x:,2...

x:,q

Column Stacking

⊗ =

For this stacking the marginal distribution over time is given bythe block diagonals.

Two Ways of Stacking

Can also stack each row of X to form column vector:

x =

x1,:x2,:...

xT,:

p(x) = N (x|0,K ⊗ I)

Row Stacking

⊗ =

For this stacking the marginal distribution over the latentdimensions is given by the block diagonals.

Observed Process

The observations are related to the latent points by a linearmapping matrix,

yi,: = Wxi,: + εi,:

ε ∼ N(0, σ2I

)

Mapping from Latent Process to Observed

Wx1,:

Wx2,:

Wx3,:

x1,:

x2,:

x3,:

W 0 0

0 W 0

0 0 W

× =

Output Covariance

This leads to a covariance of the form

(I ⊗W)(K ⊗ I)(I ⊗W>) + Iσ2

Using (A ⊗ B)(C ⊗D) = AC ⊗ BD This leads to

K ⊗WW> + Iσ2

ory ∼ N

(0,WW>

⊗K + Iσ2)

Kernels for Vector Valued Outputs: A Review

Foundations and TrendsR© inMachine LearningVol. 4, No. 3 (2011) 195–266c© 2012 M. A. Alvarez, L. Rosasco and N. D. LawrenceDOI: 10.1561/2200000036

Kernels for Vector-ValuedFunctions: A Review

By Mauricio A. Alvarez,

Lorenzo Rosasco and Neil D. Lawrence

Contents

1 Introduction 197

2 Learning Scalar Outputs

with Kernel Methods 200

2.1 A Regularization Perspective 200

2.2 A Bayesian Perspective 202

2.3 A Connection Between Bayesian

and Regularization Points of View 205

3 Learning Multiple Outputs with

Kernel Methods 207

3.1 Multi-output Learning 207

3.2 Reproducing Kernel for Vector-Valued Functions 209

3.3 Gaussian Processes for Vector-Valued Functions 211

4 Separable Kernels and Sum of Separable Kernels 213

4.1 Kernels and Regularizers 214

4.2 Coregionalization Models 217

4.3 Extensions 228

Kronecker Structure GPs

I This Kronecker structure leads to several publishedmodels.

(K(x, x′))d,d′ = k(x, x′)kT(d, d′),

where k has x and kT has n as inputs.I Can think of multiple output covariance functions as

covariances with augmented input.I Alongside x we also input the d associated with the output

of interest.

Separable Covariance Functions

I Taking B = WW> we have a matrix expression acrossoutputs.

K(x, x′) = k(x, x′)B,

where B is a p × p symmetric and positive semi-definitematrix.

I B is called the coregionalization matrix.I We call this class of covariance functions separable due to

their product structure.

Sum of Separable Covariance Functions

I In the same spirit a more general class of kernels is given by

K(x, x′) =

q∑j=1

k j(x, x′)B j.

I This can also be written as

K(X,X) =

q∑j=1

B j ⊗ k j(X,X),

I This is like several Kalman filter-type models addedtogether, but each one with a different set of latentfunctions.

I We call this class of kernels sum of separable kernels (SoSkernels).

Geostatistics

I Use of GPs in Geostatistics is called kriging.I These multi-output GPs pioneered in geostatistics:

prediction over vector-valued output data is known ascokriging.

I The model in geostatistics is known as the linear model ofcoregionalization (LMC, Journel and Huijbregts (1978);Goovaerts (1997)).

I Most machine learning multitask models can be placed inthe context of the LMC model.

Weighted sum of Latent Functions

I In the linear model of coregionalization (LMC) outputs areexpressed as linear combinations of independent randomfunctions.

I In the LMC, each component fd is expressed as a linear sum

fd(x) =

q∑j=1

wd, ju j(x).

where the latent functions are independent and havecovariance functions k j(x, x′).

I The processes f j(x)qj=1 are independent for q , j′.

Kalman Filter Special Case

I The Kalman filter is an example of the LMC whereui(x)→ xi(t).

I I.e. we’ve moved form time input to a more general inputspace.

I In matrix notation:1. Kalman filter

F = WX

2. LMCF = WU

where the rows of these matrices F, X, U each contain qsamples from their corresponding functions at a differenttime (Kalman filter) or spatial location (LMC).

Intrinsic Coregionalization Model

I If one covariance used for latent functions (like in Kalmanfilter).

I This is called the intrinsic coregionalization model (ICM,Goovaerts (1997)).

I The kernel matrix corresponding to a dataset X takes theform

K(X,X) = B ⊗ k(X,X).

Autokrigeability

I If outputs are noise-free, maximum likelihood isequivalent to independent fits of B and k(x, x′) (Helterbrandand Cressie, 1994).

I In geostatistics this is known as autokrigeability(Wackernagel, 2003).

I In multitask learning its the cancellation of intertasktransfer (Bonilla et al., 2008).


K(X,X) = ww> ⊗ k(X,X).

w =

[15

]B =

[1 55 25

]


K(X,X) = B ⊗ k(X,X).

B =

[1 0.5

0.5 1.5

]

LMC Samples

K(X,X) = B1 ⊗ k1(X,X) + B2 ⊗ k2(X,X)

B1 =

[1.4 0.50.5 1.2

]`1 = 1

B2 =

[1 0.5

0.5 1.3

]`2 = 0.2

LMC in Machine Learning and Statistics

I Used in machine learning for GPs for multivariateregression and in statistics for computer emulation ofexpensive multivariate computer codes.

I Imposes the correlation of the outputs explicitly throughthe set of coregionalization matrices.

I Setting B = Ip assumes outputs are conditionallyindependent given the parameters θ. (Minka and Picard,1997; Lawrence and Platt, 2004; Yu et al., 2005).

I More recent approaches for multiple output modeling aredifferent versions of the linear model of coregionalization.

Semiparametric Latent Factor Model

I Coregionalization matrices are rank 1 Teh et al. (2005).rewrite equation (??) as

K(X,X) =

q∑j=1

w:, jw>:, j ⊗ k j(X,X).

I Like the Kalman filter, but each latent function has adifferent covariance.

I Authors suggest using an exponentiated quadraticcharacteristic length-scale for each input dimension.

Semiparametric Latent Factor Model Samples

K(X,X) = w:,1w>:,1 ⊗ k1(X,X) + w:,2w>:,2 ⊗ k2(X,X)

w1 =

[0.51

]w2 =

[1

0.5

]

Gaussian processes for Multi-task, Multi-output andMulti-class

I Bonilla et al. (2008) suggest ICM for multitask learning.I Use a PPCA form for B: similar to our Kalman filter

example.I Refer to the autokrigeability effect as the cancellation of

inter-task transfer.I Also discuss the similarities between the multi-task GP and

the ICM, and its relationship to the SLFM and the LMC.

Multitask Classification

I Mostly restricted to the case where the outputs areconditionally independent given the hyperparameters φ(Minka and Picard, 1997; Williams and Barber, 1998; Lawrenceand Platt, 2004; Seeger and Jordan, 2004; Yu et al., 2005;Rasmussen and Williams, 2006).

I Intrinsic coregionalization model has been used in themulticlass scenario. Skolidis and Sanguinetti (2011) use theintrinsic coregionalization model for classification, byintroducing a probit noise model as the likelihood.

I Posterior distribution is no longer analytically tractable:approximate inference is required.

Computer Emulation

I A statistical model used as a surrogate for acomputationally expensive computer model.

I Higdon et al. (2008) use the linear model ofcoregionalization to model images representing theevolution of the implosion of steel cylinders.

I In Conti and O’Hagan (2009) use the ICM to model avegetation model: called the Sheffield Dynamic GlobalVegetation Model (Woodward et al., 1998).

References I

E. V. Bonilla, K. M. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In J. C. Platt, D. Koller,Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, Cambridge, MA,2008. MIT Press.

S. Conti and A. O’Hagan. Bayesian emulation of complex multi-output and dynamic computer models. Journal ofStatistical Planning and Inference, 140(3):640–651, 2009. [DOI].

G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Missero, and D. di Bernardo. Direct targets of thetrp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering.Genome Research, 18(6):939–948, Jun 2008. [URL]. [DOI].

P. Goovaerts. Geostatistics For Natural Resources Evaluation. Oxford University Press, 1997. [Google Books] .

J. D. Helterbrand and N. A. C. Cressie. Universal cokriging under intrinsic coregionalization. Mathematical Geology,26(2):205–226, 1994.

D. M. Higdon, J. Gattiker, B. Williams, and M. Rightley. Computer model calibration using high dimensional output.Journal of the American Statistical Association, 103(482):570–583, 2008.

A. G. Journel and C. J. Huijbregts. Mining Geostatistics. Academic Press, London, 1978. [Google Books] .

A. A. Kalaitzis and N. D. Lawrence. A simple approach to ranking differentially expressed gene expression timecourses through Gaussian process regression. BMC Bioinformatics, 12(180), 2011. [DOI].

N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In R. Greiner andD. Schuurmans, editors, Proceedings of the International Conference in Machine Learning, volume 21, pages 512–519.Omnipress, 2004. [PDF].

T. P. Minka and R. W. Picard. Learning how to learn is learning with point sets. Available on-line., 1997. [URL].Revised 1999, available at http://www.stat.cmu.edu/˜minka/.

J. Oakley and A. O’Hagan. Bayesian inference for the uncertainty distribution of computer model outputs.Biometrika, 89(4):769–784, 2002.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.[Google Books] .

M. Seeger and M. I. Jordan. Sparse Gaussian Process Classification With Multiple Classes. Technical Report 661,Department of Statistics, University of California at Berkeley,

http://dx.doi.org/doi:10.1016/j.jspi.2009.08.006

http://dx.doi.org/10.1101/gr.073601.107

http://dx.doi.org/10.1101/gr.073601.107

http://books.google.com/books?as_isbn=0-19-511538-4


http://dx.doi.org/10.1186/1471-2105-12-180

ftp://ftp.dcs.shef.ac.uk/home/neil/mtivm.pdf

http://research.microsoft.com/en-us/um/people/minka/papers/point-sets.html

http://www.stat.cmu.edu/~ minka/

http://books.google.com/books?as_isbn=0-262-18253-X

References II

G. Skolidis and G. Sanguinetti. Bayesian multitask classification with Gaussian process priors. IEEE Transactions onNeural Networks, 22(12):2011 – 2021, 2011.

Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In R. G. Cowell and Z. Ghahramani,editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 333–340,Barbados, 6-8 January 2005. Society for Artificial Intelligence and Statistics.

H. Wackernagel. Multivariate Geostatistics: An Introduction With Applications. Springer-Verlag, 3rd edition, 2003.[Google Books] .

C. K. Williams and D. Barber. Bayesian Classification with Gaussian processes. IEEE Transactions on Pattern Analysisand Machine Intelligence, 20(12):1342–1351, 1998.

I. Woodward, M. R. Lomas, and R. A. Betts. Vegetation-climate feedbacks in a greenhouse world. PhilosophicalTransactions: Biological Sciences, 353(1365):29–39, 1998.

K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In Proceedings of the 22ndInternational Conference on Machine Learning (ICML 2005), pages 1012–1019, 2005.


Fitting Covariance and Multioutput Gaussian...

Documents

Transcript of Fitting Covariance and Multioutput Gaussian...