Uncertainty Quantication Using Deep Gaussian Processes...Deep Gaussian Process K \ Provides a...

$: Uncertainty Quantication Using Deep Gaussian Processes...Deep Gaussian Process K \ Provides a Probabilistic presentation of the model K f1~ GP f3~ GP f2~ GP 1 Deep Gaussian process$
Uncertainty Quantification Using Deep Gaussian

Processes

Alireza Daneshkhah

Warwick Centre for Predictive ModellingThe University of Warwick

Alireza Daneshkhah Uncertainty Quantification Using Deep Gaussian Processes 1 / 35

Motivation

Deterministic Solver

Red

uct

ion

Den

sity

Est

imat

ion

Rec

onst

ruct

ion

Observed input

Bayesian Training

Surrogate

Model

• Tree construction.

• HDMR terms.

• Experimental design.

A =

{

α(s)

}SA

s=1

Reduced input space Output space

• Output correlations.

Data collection

Statistics PDFs Error bars

Bilionis and Zabaras (2012)


The multiscale Modelling Challenges

1 The complex multiscale physical models challenges◮ Curse of dimensionality◮ Computational complexity (& limited data)◮ Discontinuity of the model output

2 The current solutions◮ Probabilistic neural networks◮ Traditional Gaussian processes◮ Multi-output separable Gaussian process

3 Deep Gaussian processes◮ Probabilistic representation◮ Analytical solution is available◮ Model dimensionality is no longer an issue


Deep Neural network

The idea is taken from the Deep Neural network with l hidden

layers

given x

h1 = φ (W1x)

h2 = φ (W2h1)

h3 = φ (W3h2)

y = w⊤4

h3y1

h31

h32

h33

h34

h21

h22

h23

h24

h25

h26

h11

h12

h13

h14

h15

h16 h1

7h1

8

x1 x2 x3 x4 x5 x6

Lawrence et al. (2014).


Problems with Deep neural network

By increasing the number of nodes in neighboring layers, the

corresponding set W (in h = φ(Wx)) becomes very big (leading to

be a overfitted model).

Solution: replace Wi with a lower rank form

Wi = UiVTi

Wk1×k2 , then Uk1×q and Vk2×q

given x

f1 = V⊤1

x

h1 = g (U1f1)

f2 = V⊤2

h3

h2 = g (U2f2)

f3 = V⊤1

h2

h3 = g (U3f3)

y = w⊤4

h3y1

h11

h12

h13

h14

h21

h22

h23

h24

h25

h26

h31

h32

h33

h34

h35

h36

h37

h38

x1 x2 x3 x4 x5 x6

Lawrence et al. (2014).


Alternative Solution: Deep GP

Put GP prior over weights and and taking width of each layer to

infinity.

Y = f3(f2(· · · f1(X))), Hi = fi(Hi−1)


Deep Gaussian Process

f1~GP

f3~GP

f2~GP

1 Deep Gaussian process◮ Bayesian belief network (DAG)◮ Non-parametric, non-linear mappings fl◮ Likelihood is a non-linear function of the inputs

2 Challenges◮ How to learn the intermediate hidden layers?◮ How to efficiently train the model?

3 Solution◮ Variational compression◮ Provides a Probabilistic presentation of the model

evidence


Non-linear mapping Using GP

Nonlinear regression problem: Learn f with error bars from data

D = {X,y}

Use GP prior on f = {fi}Ni=1 given X = {xi}

Ni=1

}

f1f2

f3fN

N function values

x

f

= ( ; )

prior

p(f |X) = N (0,KN)

KN

K(xi,xj) = τ 2exp

−1

2

q∑

k=1

(

x(k)i − x

(k)j

ωk

)2


GP Regression

yi = fi + ǫi, ǫi ∼ N(0, σ2)

marginal likelihood: p(y | X) = N (0,KN + σ2I)

predictive distribution: p(y∗ | x∗,y,X) = N (µ∗, σ2∗)

µ∗ = K∗N(KN + σ2I)−1y

σ2∗= K∗∗ − K∗N(KN + σ2I)−1KN∗ + σ2

Problem: N3 computation

Solution: Use sparse GP approximation based on a small set of M

pseudo-inputs or inducing variables


Sparse GP Using pseudo-inputs

1 Choose any set of M ≪ N inducing variables X

2 Draw corresponding function values f from prior◮ p(f | X) = N (0,KM)

3 Draw f conditioned on f◮ p(f | f) = N (KNMK−1

M f, Σ = KN − KNMK−1M KMN)

X

x

f

f f

−

Σ


Sparse Pseudo-input GP (SPGP)∫

∏

|

GP prior

N (0,KN) ≈SPGP prior

p(f) = N (0,KNMK−1

MKMN + Λ)

≈ = +

SPGP covariance computation: O(M2N)

The predictive mean computational complexity: O(M)

The predictive variance computational complexity: O(M2)


How to find Sparse Pseudo-inputs?

consider these inputs as the extra hyper-parameters and

maximize the marginal likelihood w.r.t (X, τ, σ,ω)

p(y | X,X, τ, σ,ω)

This Joint optimization avoids discontinuities that arise when the

design points are selected .

We use this augmented variable method, followed by a collapsed

variational approximation for learning Deep GP.


Sparse Pseudo-inputs position

amplitudeamplitudeamplitude

x

y

X

amplitude lengthscale noise amplitudeamplitudeamplitude

x

y

X

amplitude lengthscale noise


Bayesian GP Latent Variable Model (GP-LVM)

Start with a standard GP-LVM.

Y

X

σ2

p(Y | X) =

p∏

j=1

N (y:,j | 0,K)

Apply standard latent variable approach◮ Define Gaussian prior over latent space, X.

p(X) =

q∏

j=1

N (x:,j | 0, α2j I)

◮ Integrate out latent variables to get p(Y) =?◮ Integration is intractable.


Standard Variational Inference

Standard variational bound has the form

L = 〈log p(y | X)〉q(X) + KL(q(X) ‖ p(X))

Requires expectation of log p(y | X) under q(X)

log p(y | X) = −1

2yT(Kff + σ2I)−1y −

1

2|Kff + σ2I| −

N

2log 2π

The computation of expectation of log p(y | X) under q(X) is

extremely difficult.

Augment the GP model by inducing variables (Z,u = f(Z))

p(f,u | Z,X) = N

((

f

u

)

|0,

(

Kff Kfu

Kuf Kuu

))

log p(y | X,Z) ≥ logN (y | 0,KfuK−1uu Kuf + σ2I)−

1

2σ2tr(Σ)

Σ = Kff − KfuK−1uu Kuf


Bayesian Variational Inference

Treat u as the extra parameters of the model

p(u) = N (u | 0,Kuu)

By applying the parametric Variational Bayes (Titsias and

Lawrence, 2010), we have

log p(y | X) ≥ logN (y | KfuK−1uu m, σ2I)−

1

2σ2tr(SK−1

uu KufKfuK−1uu )

−KL(p(u) ‖ q(u))−1

2σ2tr(Σ)

The variational distribution is q(u) = N (u | m,S)


Deep GP representation - Process Composition

f1~GP

f3~GP

f2~GP

Deep GP: y = hl(hl−1(. . . h1(X))) + ǫ

Joint pdf:

p(y, {hi}li=1 | x) = p(y | hl)

l∏

i=2

p(hi | hi−1)p(h1 | x)

h1 | X = N (0,Kh1h1+ σ2

1I)

hi | hi−1 = N(0,Khihi+ σ2

i I)

y | hl = N(0,Khlhl+ σ2

l I)

The direct computation of p(y | X) is intractable

(O(N3)).


Computational Challenges in Learning Deep GP

1 Marginalise out all hidden layers in a Bayesian framework (Titsiaset al. 2010)

◮ The number of parameters is drastically reduced.◮ The deep network structure can be automatically determined

2 The direct marginalization of hi’s are intractable.

p(y | x) =∫

p(y | h1)(∫

p(h1 | h2)p(h2 | x)dh2

)

dh1

p(h2 | x) =∫

p(h1 | f2)p(f2 | h2)p(h1 | x)dh1df2

p(f2 | h2) contains non-linear kernel K−1f2f2


Deep GP augmented by Inducing Variables: An

Example

f1~GP

f3~GP

f2~GP


Variational (Compression) Inference for Deep GPs

Augment layer hi with a set of inducing variables ui

Apply Bayesian variational inference withing each layer

The bound on the conditional probability is:

p(y, {hi}li=1 | {ui}

li=1x) ≥ p(y | hl,ul)×

l∏

i=2

p(hi | hi−1,ui)p(h1 | x,u1)

× exp

(

l∑

i=1

−1

2σ2i

tr(Σi)

)

p(hi | ui,hi−1) = N (hi | KhiuiK−1

uiuiui, σ

2i I)

Σi = Khihi− Khiui

K−1uiui

Kuihi


Variational Compression for Deep GP (3)

Given x and a fixed q(u1) = N (u1 | m1,S1) compute

q(h1) =

∫

p(h1 | u1,x)q(u1)du1

Given q(h1), we can variationally propagate using q(u2) and

marginalize out h1

log p(h2 | x,u2) ≥ −〈1

2σ22

tr(Σ1)〉q(h1) −1

2σ21

tr(Σ0)

−KL(q(u1)‖p(u1)) + log N(h2 | Ψ2K−1u2u2

u2, σ22 I)

−1

σ22

tr(

(Φ2 −ΨT2Ψ2)K

−1u2u2

u2uT2 K−1

u2u2

)


The marginal likelihood bound

Continue to feed-forward to the bottom layer using the variational

propagation at each layer, the marginal likelihood bound is

log p(y | X) ≥ −l∑

i=2

1

2σ2i

(ψi − tr(ΦiK−1uiui

))−1

2σ21

tr(Σ1)

−l∑

i=1

KL(q(ui)‖p(ui)) + logN (y | ΨlK−1ulul

ml, σl2I)

−l∑

i=1

1

σ2i

tr(

(Φi −ΨTi Ψi)K

−1uiui

〈uiuTi 〉q(ui)K

−1uiui

)

Φi = 〈KuihiKhiui

〉q(hi−1),Ψi = 〈Khiui〉q(hi−1), ψi = 〈tr(Khihi

)〉q(hi−1)


VC for Deep GP - Points

1 All the terms of the given bound are tractable, including the KL

term

2 However, the rate of tractability depends on the selected

covariance function (the same as GPLVM) and how feasible can

be convoluted with q(hl).

3 A gradient based optimization method can be used to maximizethe final form of the variational lower bound w.r.t to

◮ model parameters: {σ2i , θi}

l+1i=2

◮ variational parameters: {Zi,mi,Si,µi+1,Σi+1}li=1


Elliptic PDE Example

−∇.(a(ω,x)∇u(ω,x)) = f (·), in D

u(ω,x) = 0, in ∂D

The physical domain is D = [0, 1]2

Z(ω,x) = log(a(ω,x))

C(x1,x2) = σ2rf exp

(

−kI∑

i=1

(x1,i − x2,i)2

λ

)

We generate N = 250 realisations of Z by truncating the KLE at

q1 = 50, when λ = 0.1.

The boundary problem is then solved with FEM over 16 × 16 grid

Response observed on a 20 × 20 grid.


Input & output realizations of the Elliptic PDEs

Training data: D = {(Zr, ur), r = 1, . . . , 200}

0 0.5 10

0.2

0.4

0.6

0.8

1

-2

-1

0

1

2

0 0.5 10

0.2

0.4

0.6

0.8

1

-2

-1

0

1

2

3

0 0.5 10

0.2

0.4

0.6

0.8

1

-2

-1

0

1

2

0 0.5 10

0.2

0.4

0.6

0.8

1

-2

-1

0

1

2

0 0.5 10

0.2

0.4

0.6

0.8

1

0.01

0.02

0.03

0.04

0.05

0.06

0 0.5 10

0.2

0.4

0.6

0.8

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.5 10

0.2

0.4

0.6

0.8

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.5 10

0.2

0.4

0.6

0.8

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07


Hidden Layers demonstrations - Elliptic Problem

0 50 100 150 200 2500

0.5

1

1.5Layer 1

1 2 3 4 5 6 7 8 9 10 11 120

2

4

6Layer 2

Latent Dimesion 1

-0.6 -0.4 -0.2 0 0.2 0.4 0.6

LD

6

-0.5

0

0.5

1Layer 2

Latent Dimension 139

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

LD

14

2

-0.2

-0.1

0

0.1

0.2Layer 1


Posterior Mean and Variance of Response - Elliptic

Problem

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0.01

0.02

0.03

0.04

0.05

0.06

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

×10-5

1.4423

1.4423

1.4423

1.4423

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1×10

-3

2

4

6

8

10


Mean of Variance & Variance of mean - Elliptic

Problem

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1×10

-5

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1×10

-5

1.2706

1.2706

1.2706

1.2706

1.2706


Flow through porous media

−∇u = 0

u = −K(X, ω)∇p, ∀x ∈ in Xs = [0, 1]2

p = 1 − x1 on ∂Xs

Deterministic solver: Mixed FEM on a 20 × 20 grid.

Response observed on a 20 × 20 grid.

G(x, ω) = log(K(x, ω) is an exponential RF with:

COVG(xs1,xs2) = s2G exp

{

−ks∑

k=1

|xs1,k − xs2,k|

λk

}

.

We employ the KLE on G and truncate it after 50 terms, λk = 0.1.


Input & output realizations of the Permeability Problem

400 data-points are generated, first 300 are used for training the

Deep GP.

0 0.5 10

0.5

1

-0.5

0

0.5

0 0.5 10

0.5

1

-0.6

-0.4

-0.2

0

0.2

0.4

0 0.5 10

0.5

1

-0.5

0

0.5

0 0.5 10

0.5

1

-0.2

0

0.2

0.4

0.6

0 0.5 10

0.5

1

-2

-1

0

1

0 0.5 10

0.5

1

-2

-1

0

1

0 0.5 10

0.5

1

-1.5

-1

-0.5

0

0.5

1

0 0.5 10

0.5

1

-1.5

-1

-0.5

0

0.5

1


Hidden Layers demonstrations - Permeability Problem

A deep GP with 2 hidden layers is fitted to the data

K = 80 ( of inducing variables).

0 50 100 150 200 250 300 350 4000

5

10

15Layer 1

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.02

0.04Layer 2

Lat Dim 1 ×10-4

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Lat

Dim

241

×10-4

-5

0

5Layer 1

Lat Dim 1

-1.5 -1 -0.5 0 0.5 1 1.5 2

Lat D

im 1

4

-2

0

2


Posterior Mean and Variance of Pressure -

Permeability Problem

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

-2

-1.5

-1

-0.5

0

0.5

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

-1.5

-1

-0.5

0

0.5

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1×10

-5

2.5072

2.5072

2.5072

2.5072

2.5072

2.5072

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02


Mean of Variance & Variance of mean of Pressure

0 0.2 0.4 0.6 0.8 1

x

0

0.2

0.4

0.6

0.8

1

y

-1.5

-1

-0.5

0

0.5

1

0 0.2 0.4 0.6 0.8 1

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y

0.5

1

1.5

2

2.5×10

-3

0 0.2 0.4 0.6 0.8 1

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y

1.1376944

1.13769445

1.1376945

1.13769455

1.1376946×10

-4


Summary

The final models is not a GP!

Deep GP provides a probabilistic approximation of the model

which is useful for UQ and also guards against overfitting

Deep GP allows unsupervised and supervised deep learning.

Using Deep GP, the curse of dimensionality is no longer an issue

Variational compression algorithms show promise for scaling

these models to massive data sets.

Sampling is straight-forward


Uncertainty Quantication Using Deep Gaussian Processes...Deep Gaussian Process K \ Provides a...

Documents

Transcript of Uncertainty Quantication Using Deep Gaussian Processes...Deep Gaussian Process K \ Provides a...