Exploiting code structure for statistical learningzoltan.szabo/jc/2019_03_07_Guillaume… ·...

Exploiting code structure for

statistical learning

G. Perrin⋆

in collaboration with J. Garnier�, C. Haberstich⋆,◇,S. Marque-Pucheu⋆ and A. Nouy◇.

⋆ CEA/DAM/DIF, Arpajon, France� CMAP, Ecole Polytechnique, Palaiseau, France◇ LMJL, ECN, Nantes, France

CMAP ∣ March 7th, 2019

Taking advantage of always increasing computational resources, theimportance of simulation keeps increasing.

It is now completely integrated in most of the decision makingprocesses of our society.

Thus, simulation has not only to be descriptive, but needs to bepredictive.

In the following, let us focus on a system S, whose design (dimensions,materials, initial conditions...) is characterized by a vector x ∈ X, andwhose behavior is analyzed through the response function y(x) ∈ Y.To predict the value of y(x), we assume we have access to aparametric simulator, ysim.

∣ March 7th , 2019 ∣ PAGE 1/44

An increasing role for simulation in our society

ytrue(xtrue) =?

∣ March 7th , 2019 ∣ PAGE 2/44

Uncertainty quantification to generate trust

ysim(xmes;βnom

,νnom) = 0.42367589235124ytrue(xtrue) = 0.42367589235124?

∣ March 7th , 2019 ∣ PAGE 2/44


ysim(xmes;βnom,νnom + εν) = 0.42367589235124

ytrue(xtrue) ∈ [0.42367589235 ±10−11]?

⇒ numerical uncertainties (finite arithmetic, compilation, resolutionschemes,...),

∣ March 7th , 2019 ∣ PAGE 2/44


ysim(xmes;βnom + εβ ,νnom + εν) = 0.42367589235124

ytrue(xtrue) ∈ [0.4236 ± 10−4]?


⇒ parametric uncertainties (physical parameters, code thresholds...),

∣ March 7th , 2019 ∣ PAGE 2/44


(ysim)(xmes + εmesx ;β

nom + εβ ,νnom + εν) = 0.42367589235124ytrue(xtrue) ∈ [0.42 ± 10−2]?



⇒ experimental uncertainties (construction tolerance, boundary andinitial conditions...).

∣ March 7th , 2019 ∣ PAGE 2/44


(ysim + εmod)(xmes + εmesx ;β

nom + εβ,νnom + εν) = 0.42367589235124ytrue(xtrue) ∈ [0.4 ± 10−1]?




⇒ model uncertainties (simplifications, missing phenomena...).

∣ March 7th , 2019 ∣ PAGE 2/44


(ysim + εmod)(xmes + εmesx ;β

nom + εβ,νnom + εν) = 0.42367589235124ytrue(xtrue) ∈ [0.4 ± 10−1]?




⇒ model uncertainties (simplifications, missing phenomena...).

To be predictive, simulation needs to be able to take into accountthe different sources of uncertainty !

∣ March 7th , 2019 ∣ PAGE 2/44


To identify all these sources of uncertainty, we need:

DATA

∣ March 7th , 2019 ∣ PAGE 3/44

Big data vs Scarce data


BIG DATA (in theory)

∣ March 7th , 2019 ∣ PAGE 3/44



BIG ideas for small DATA (in practice)

∣ March 7th , 2019 ∣ PAGE 3/44



BIG ideas for small DATA (in practice)

as at CEA,

the time for one code evaluation is generally around 10000 hoursCPU...

and the cost per experiment on one of the main experimental facilities(EPURE , LMJ) easily exceeds 1 million euros...

∣ March 7th , 2019 ∣ PAGE 3/44


LMJ: towards inertial confinement fusion...

EPURE: controlling the implosion of non spherical systems...

∣ March 7th , 2019 ∣ PAGE 4/44

Two examples...

ytrue(xtrue) = (ysim + εmod)(xmes + εmesx ;βnom + εβ,νnom + εν)

= (ymes + εmesy )(xmes + εmes

x )

1 Sensitivity analysis : find the elements of x, β and ν that play themost important roles on the variability of ysim(x;β,ν).

2 Code validation : given a set of (noisy) experimental measurements,identify εβ, εν and εmod.

3 Robust optimization : find x⋆ (using stochastic simulator ysim) suchthat C(ytrue(x⋆)) is minimal, with C a cost function.

4 System certification : guarantee (using stochastic simulator ysim)that the risk that ytrue(x⋆) exceeds S is lower than α (with asatisfying confidence).

∣ March 7th , 2019 ∣ PAGE 5/44

Some challenges of VV-UQ approaches

To solve the former problems, we generally need a huge number ofcode evaluations.

Hence, when confronted to very costly simulators, the code responsehas to be replaced by a surrogate model (or metamodel).

ysim = ymeta + εmeta.

⇒ Warning ! Replacing the true code by its surrogate introduces another source of uncertainty that also needs to be taken into account.

∣ March 7th , 2019 ∣ PAGE 6/44

The crucial role of statistical learning in VV-UQ

Classical formalism

The code is modeled by a black-box code (point-wise approach),whose response is supposed to be square integrable over X.

The maximal information is constituted of N code evaluations(xn, y(xn))Nn=1 (constraint on the maximal budget).

The points xn, 1 ≤ n ≤ N , generally correspond to a set of iidelements or to a deterministic space filling design.

Objective: based on X = (x(n), y(x(n)))Nn=1, construct a predictor y suchthat ∥y − y∥L2 is minimal.

∣ March 7th , 2019 ∣ PAGE 7/44

Reference framework for statistical learning incomputer experiment

There exists a wide variety of surrogate models. In the following of thispresentation, we will focus on :

The Gaussian process regression (or kriging) → ”GPR”:

⇒ y is a realization of a Gaussian process Y ∼ GP(µ(⋅;θµ),C(⋅, ⋅;θC)).⇒ y = E [y ∣ X ,θMLE

µ ,θMLEC ] (”plug-in” approach).

Tree-based composed functions → ”TBCF”:

⇒ Let T be a given dimension tree over D = {1, . . . , d}.⇒ Let FT be the set of functions written as a series of compositions of

functions defined on subspaces of X controlled by T . For instance:

y(x) = f{1,2,3,4,5}(f{1,2,3}(x1, f{2,3}(x2, x3)), f{4,5}(x4, x5))

⇒ y is the (empirical) projection (using X ) of y on FT .∣ March 7th , 2019 ∣ PAGE 8/44

Different types of surrogate models

Problematic 1

Complex codes often aggregate several ”black-box codes” ⇒ how tointegrate information about the code inner structure to improve theprediction capacity of y ?

⇒ The inner structure may be unknown.

⇒ The intermediary inputs/outputs may be high dimensional.

⇒ It may not be possible to have access to the values of the intermediaryinputs/outputs.

⇒ It may not be possible to separately call each code of the network.∣ March 7th , 2019 ∣ PAGE 9/44

From unique black-box codes to code networks

If X = [0,1]d, x1,x2 ∼ U(X), Z = ∥x1 − x2∥22,

E [Z] = d

6→d→+∞ +∞, δ2(Z) = Var(Z)

E [Z]2 =1.4

d→d→+∞ 0.

Problematic 2

When the dimension of the input space increases, all points are far theones from the others ⇒ which (non-space filling) strategy can beproposed to improve the prediction capacity of y when d increases ?

⇒ There may be constraints on the input space.

⇒ We may be interested in particular values of y (around a threshold forcertification, a minimum value for conception, ...).

⇒ It may be useful to exploit the code inner structure to adapt thesampling strategies.

∣ March 7th , 2019 ∣ PAGE 10/44

From static learning to active learning

1 Introduction

2 GPR when the intermediary values are available

3 TBCF when the intermediary values are not available

4 Conclusion

∣ March 7th , 2019 ∣ PAGE 11/44

Outline

We consider a function y(x1, . . . , xd) in the space H = L2µ(X,Rdy) of

square-integrable functions with values in Rdy defined on

X = X1 ×⋯×Xd equipped with a probability measure µ = µ1 ⊗⋯⊗µd.

The inner product and the norm in H are respectively written ⟨⋅, ⋅⟩H

and ∥⋅∥H, such that for all u, v ∈ H,

⟨u, v⟩H= ∫

X

u(x)T v(x)dµ(x), ∥u∥2H= ⟨u,u⟩

H.

Each evaluation of y is supposed to be time consuming.

In the following, X = [−1,1]d, dµ(x) = dx2d.

∣ March 7th , 2019 ∣ PAGE 12/44

Theoretical framework

We assume that the code associated with x↦ y(x) is constituted ofa series of nested codes.

The inner structure is supposed to be known (without loop), and theintermediary results are available.

The intermediary quantities may be scalar or vectors.

It is possible to launch separately each code of the network.

Figure: Example of a code network

∣ March 7th , 2019 ∣ PAGE 13/44

Existence of a code structure

Each function yi is considered as a particular realization of a GP Yi.

Each GP is conditioned by a series of Ni code evaluations, and we

write µi + εi = Yi∣(x(ni)i , yi(x(ni)

i ))Ni

ni=1, with εi the (non-stationary)

centered GP characterizing the metamodel uncertainty.

Figure: Replacement of the code functions by their GP surrogates

∣ March 7th , 2019 ∣ PAGE 14/44

GP approximation of the code network

The combination of Gaussian processes is not Gaussian. Let Ynest∣Xbe the nested random process that aggregates all these conditionedGPs.

Using sampling techniques (such as MCS), it is possible to propagatethe surrogate uncertainties through the GP network to compute anempirical estimate of the mean of Ynest∣X based on M independentrealizations of each GP, written {Ynest(ωm)∣X , 1 ≤m ≤M}.It comes:

y = 1

M

M

∑m=1

Ynest(ωm)∣X .

∣ March 7th , 2019 ∣ PAGE 15/44

Construction of the predictor

If the dimension dyi of yi is (much) higher than the one of xi, thedimensions of the input spaces can quickly increase, making thelearning more and more difficult.

To avoid this problem, it is often interesting to exploit potentialcorrelations between the components of yi.

In each node of the network, this amounts at searching the(dyi × dyi)-dimensional matrix Di, with rank ri ≪ dyi , such that:

∥yi+1(⋅,Diyi) − yi+1(⋅, yi)∥2His minimal.

Hence, Di is supposed to condense the statistical information of yiin a small number of parameters while being adapted to yi+1.

∣ March 7th , 2019 ∣ PAGE 16/44

Some words on reduction techniques (1/2)

To find interesting approximations of Di using a reduced number of codeevaluations, we proposed in [Marque-Pucheu et al., 2019] to follow atwo-step procedure, introducing a linear relationship between yi and yi+1.Di is therefore searched as the ri-rank matrix such that:

Ai = arg minA∈Mdyi+1

×dyi(R)

Ni∑ni=1

∥yi+1(x(ni)i+1 , yi(x(ni)

i )) −Ayi(x(ni)i )∥2

2,

Di = arg minrank(D)≤ri

Ni∑ni=1

∥yi+1(x(ni)i+1 , yi(x(ni)

i )) −AiDyi(x(ni)i )∥2

2,

whereMdyi+1×dyi(R) is the set of (dyi+1 × dyi)-dimensional real matrices.

⇒ These problems can be solved explicitly (introducing regularizations).

⇒ Ai can be seen as the mean gradient of yi+1 with respect to yi.⇒ Constraints can be added to Ai to integrate information about yi+1(causality, monotony...).

∣ March 7th , 2019 ∣ PAGE 17/44

Some words on dimension reduction (2/2)

Until now, each GP Yi is supposed to be conditioned by theevaluations of yi in the points x(ni), 1 ≤ ni ≤ Ni.

The choice of these Ni points is generally motivated by the fact thatthe pointwise error between yi and its GP approximation is controlledby the minimax criterion CmM (up to a normalization of the inputs),which writes:

CmM(x(1), . . . ,x(Ni)) =maxx∈Xi

min1≤ni≤Ni

∥x − x(ni)∥2.

Hence, the points x(ni), 1 ≤ ni ≤ Ni, most of the time correspond tospace-filling designs (including or not constraints on the projections insubspaces of Xi, see [Perrin and Cannamela, 2017]).

∣ March 7th , 2019 ∣ PAGE 18/44

Design of experiments and error control

To reduce the prediction error on the final output, a first idea consistsin minimizing the errors in each node of the network.

Hence, the objective is to add new points in the empty regions of thedifferent input spaces.

x1

x2 y2

y1

∣ March 7th , 2019 ∣ PAGE 19/44

Code networks add errors (⇒ local enrichment)

To reduce the prediction error on the final output, a first idea consistsin minimizing the errors in each node of the network.

Hence, the objective is to add new points in the empty regions of thedifferent input spaces.

x1

x2 y2

y1

To avoid useless code evaluations, the surrogate models beforeenrichment can be used to generate a huge number of points that arelikely to be in the output spaces at the different nodes. Clusteringmethods can then allow selecting interesting batches of points for theenrichment.

∣ March 7th , 2019 ∣ PAGE 19/44

Code networks add errors (⇒ local enrichment)

As an alternative, sequential designs based on a stepwise uncertaintyreduction (SUR) can be proposed [Marque-Pucheu et al., 2018]:

If the codes cannot be run separately → Chained I-optimal,

xnew = argminx∈X

∫X

V(Ynest(x′)∣X ∪x)dx′,If the codes can be run separately → Best I-optimal,

(inew, xnewinew) = argmax

xi∈Xi, i≤d

1

τi×∫

X

[V (Ynest(x′)∣X ) −V (Ynest(x′)∣X ∪ xi)]dx′,where

(xi, Xi) ∶= {(xi,Xi) if we are at a leaf of the code network,

((xi, yj),Xi × µj (Xj)) if we are at a node,

and τi is the computational cost of code i.∣ March 7th , 2019 ∣ PAGE 20/44

Code networks compensate errors (⇒ globalenrichment)

xnew = argminx∈X

E [V(Ynest(x′)∣X ∪x)] , x′ ∼ U(X).

/ There is no reason for x ↦ E [V(Ynest(x′)∣X ∪ x)] to be convex.

/ As Ynest is not Gaussian, there is a priori no closed-form expression forE [V(Ynest(x′)∣X ∪ x)].

∣ March 7th , 2019 ∣ PAGE 21/44

Remarks on the optimization problem

xnew = argminx∈X

E [V(Y linnest(x′)∣X ∪x)] , x′ ∼ U(X).

/ There is no reason for x ↦ E [V(Ynest(x′)∣X ∪ x)] to be convex.

, Y linnest is the Gaussian approximation of Ynest based on a series of Taylor

expansions, such that :

(µi+1 + εi+1)(xi+1, (µi + εi)(xi))≈ µi+1(xi+1, µi(xi)) + ∂µi+1

∂yi(xi+1, µi(xi))εi(xi) + εi+1(xi+1, µi(xi)).

Its variance is therefore explicitly known, and the former optimizationproblem becomes a classical robust optimization problem.

, No code evaluation is required to compute E [V(Y linnest(x′)∣X ∪ x)] for

any x ∈ X.∣ March 7th , 2019 ∣ PAGE 21/44

Remarks on the optimization problem

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Code 1

x1

y1

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Code 2

y1

y2

Code outputPolynomial trend

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

Code 3

y2

y3

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

Nested code

x1

y3

y(x) = y3(y2(y1(x1))).∣ March 7th , 2019 ∣ PAGE 22/44

Analytical example (1/2)

Sequential vs clustering-based designs Distribution (Best)

10 20 30 40 50

−3

−2

−1

0

−3

−2

−1

0

−3

−2

−1

0

Computational cost

∥y−y∥ L 2

(log)

10 20 30 40 5010

20

30

40

50

60

70

Computational costComputational costComputational costNumber

ofcoderuns

Clustering-basedChained I-optimalBest I-optimal

Code 1Code 2Code 3∣ March 7th , 2019 ∣ PAGE 23/44

Analytical example (2/2)

Physical problem

y(x) = y2(x2, y1(x1)) : coupling of a detonation code with a structuraldynamics code ⇒ functional outputs.

(a) Tank (b) First code (c) Second code

”Blind-box” ↔ the code inner structure is not taken into account.

∣ March 7th , 2019 ∣ PAGE 24/44

Industrial example (1/3)

(x1)1 Radius of the explosive charge (m)

(x1)2 Temporal dilatation parameter (-)

(x1)3 Shock magnitude parameter (-)

(x1)4 Attenuation parameter (-)

y1 Pressure generated by the shock wave

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

t

y1(⋅,t)

(x2)1 Internal radius of the tank (m)

(x2)2 Thickness (m)

(x2)3 Young modulus of the tank (Pa)

(x2)4 Young modulus of the tap (Pa)

y2 Maximal Von Mises stress over the innersurface of the tank 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

ty(⋅,t)

∣ March 7th , 2019 ∣ PAGE 25/44


Prediction accuracy along the sequential designs:

50 60 70 80 90 100−2.0

−1.9

−1.8

−1.7

−1.6

−1.5

−1.4

Computational cost

log10(∥y−y∥ L 2) Blind-box I-optimal

Linearized Chained I-optimalLinearized Best I-optimal

∣ March 7th , 2019 ∣ PAGE 26/44


1 Introduction



4 Conclusion

∣ March 7th , 2019 ∣ PAGE 27/44

Outline

We assume that the code associated with x↦ y(x) is constituted ofa series of (time-consuming) nested codes.

The inner structure is supposed to be known (without loop), and theintermediary results are available.


It is possible to launch separately each code of the network.


∣ March 7th , 2019 ∣ PAGE 28/44


We assume that the code associated with x↦ y(x) is constituted ofa series of (time-consuming) nested codes.

The inner structure is supposed to be known (without loop), BUT theintermediary results are NOT available.


It is NOT possible to launch separately each code of the network.


∣ March 7th , 2019 ∣ PAGE 28/44


Let us first consider the case of two nested codes :

x1 →x2

y1(x1)↘↗ y(x) ∶= y2(y1(x1),x2),

with x ∶= (x1,x2) ∈ Rd1+d2 , y1 (x1) ∈ Rdy1 , y (x) ∈ R.Problematic

If there is no observation of the intermediary variable y1, how can wetake into account the available prior knowledge of the nested structure of yfor the construction of its predictor?

∣ March 7th , 2019 ∣ PAGE 29/44

The case of two nested codes (1/3)

In theory, there is no restriction for replacing y1 and y2 by Gaussianprocesses, whose statistical moments are controlled by a reducednumber of deterministic parameters:

⎧⎪⎪⎨⎪⎪⎩y1 ∼ GP (µ1(θµ

1),C1(θC

1 )) ,y2 ∼ GP (µ2(θµ

2),C2(θC

2 )) .Constructing a predictor for y amounts at searching the values of θµ

1,

θC1 , θ

µ2and θC

2 that best fit the available data (gathered inX = (x(n), y(x(n)))Nn=1).

/ As the composition of two Gaussian processes is not Gaussian, thelikelihood L(θµ

1,θC

1 ,θµ2,θC

2 ) is not easy to evaluate (integral indimension N × dy1 ...).

/ Even if the likelihood was easy to compute, its maximization would beextremely challenging.

∣ March 7th , 2019 ∣ PAGE 30/44


To solve the former problem, a series of simplifications can be foundin [Cutajar et al., 2016] (factorization of the likelihood, use ofmini-batches, approximations of the derivatives, Gaussian priors...),but they generally require big datasets (N > 104).To deal with problems with relatively few data (N ∼ 102 − 103), thefollowing parametrization presents many advantages:

y(x1,x2) ∼ GP (µ2(x2, µ1(x1;θµ1);θµ

2),C1,2((x1,x2), (x1,x2);θC

1,2)) ., As the code structure is taken into account by the mean function only,

the likelihood becomes explicit.

, If the relationship between µ2 and θµ2is linear, the maximum

likelihood estimator of θµ2can be written as a function of θµ

1and θC

1,2.

/ (θµ1,θC

1,2) ↦ L(θµ1,θC

1,2) can be strongly non-convex ⇒ heuristicswere proposed in [Perrin et al., 2017] to find relevant values ofθµ1,θC

1,2 at a reasonable computational time.∣ March 7th , 2019 ∣ PAGE 31/44


It is clear that the more nodes there are in the code network, the moretrouble we will have trying to adapt the former methodology.

In the same manner, the more complex the code network will be, themore sensitive to the data and to the values of the parameters theresults will be.

⇒ let us try tree-based composed functions (TBCF) instead of GPs tostabilize the learning !

To this end, space H = L2µ(X,Rdy) is identified with a tensor product of

Hilbert spaces Hi of functions defined on Xi,

H = H1 ⊗⋯⊗Hd.

In the following, we focus on the case where the dataset is not alreadygenerated for the statistical learning [Nouy, 2019] (Alternative algorithmscould be proposed for cases where the dataset is already generated, see[Grelier et al., 2019]).

∣ March 7th , 2019 ∣ PAGE 32/44

Limitations when dealing with more complexcode networks

y(x) = f{1,2,3,4,5}(f{1,2,3}(x1, f{2,3}(x2, x3)), f{4,5}(x4, x5))

∣ March 7th , 2019 ∣ PAGE 33/44

Leaves to root active learning strategy

y(x) = f{1,2,3,4,5}(f{1,2,3}(x1, f{2,3}(x2, x3)), f{4,5}(x4, x5))

For each leaf i (here, i ∈ {1,2,3,4,5}) of the tree, we introduce anapproximation space Hi (Legendre polynomials for instance). We thensearch Ui as the ri-dimensional principal subspace of Hi for y.

∣ March 7th , 2019 ∣ PAGE 33/44


y(x) = f{1,2,3,4,5}(f{1,2,3}(x1, f{2,3}(x2, x3)), f{4,5}(x4, x5))


For each intermediary node α (here, α ∈ {{1,2,3} ,{2,3} ,{4,5}}), weintroduce the approximation space Hα = ⊗β∈S(α)Uβ, where S(α)denotes the ”sons” of α. We then search Uα as the rα-dimensionalprincipal subspace of Hα for y.

∣ March 7th , 2019 ∣ PAGE 33/44


y(x) = f{1,2,3,4,5}(f{1,2,3}(x1, f{2,3}(x2, x3)), f{4,5}(x4, x5))


For each intermediary node α (here, α ∈ {{1,2,3} ,{2,3} ,{4,5}}), weintroduce the approximation space Hα = ⊗β∈S(α)Uβ, where S(α)denotes the ”sons” of α. We then search Uα as the rα-dimensionalprincipal subspace of Hα for y.

The final approximation y is given by the projection of y onto thetensor product space of the principal subspaces:

y = ΠUDy, UD = ⊗β∈S({1,...,d})Uβ.

∣ March 7th , 2019 ∣ PAGE 33/44


For each node α ⊂ {1, . . . , d} (which can be a leaf or not), theapproximation of the rα-dimensional principal subspace is based on theidentification of y with a (bivariate) function z of the groups of variablesxα and xαc , which admits the following singular value decomposition:

y(x) = z(xα,xαc) = +∞∑i=1

√λiu

αi (xα)uαc

i (xαc), λi+1 ≤ λi.

(Simplified) algorithm

Let x(1)αc , . . . ,x

(Mα)αc be Mα values of xαc chosen at random.

Let vα1, . . . , vαQα

be an orthonormal basis of approximation space Vα.

Let USV T be the SVD-decomposition of the (Qα ×Mα) matrix W

gathering the LS coefficients of the projection of xα ↦ z(xα,x(m)αc )

on vα1, . . . , vαQα

using Pα (well-chosen) evaluations of y.

Uα ≈ Uα ∶= span{∑Qα

q=1(U)q,1vαq , . . . ,∑Qα

q=1(U)q,rαvαq }.∣ March 7th , 2019 ∣ PAGE 34/44

Practical identification of the principal subspaces

−1

0

1

−1

0

1−1

0

1

x1x2

x3

(a) Space-filling design for GPR

−1

0

1

−1

0

1−1

0

1

x1x2

x3

(b) Goal-oriented design for TBCF

Focusing on the leaves, we notice an important difference in thepositions where the code is evaluated between GPR and TBCF.

, Even if d is high, the different approximations are only made in smalldimensional spaces.

, The algorithm is quick and stable as it is only based on a series ofSVD.

∣ March 7th , 2019 ∣ PAGE 35/44

Goal-oriented designs vs. space-filling designs

20 40 60 80 100 120 14010

−6

10−5

10−4

10−3

10−2

10−1

100

N

∥y−y∥2 L

2/∥y∥2 L

2

(c) g2D

30 40 50 60 70 80 90 10010

−10

10−8

10−6

10−4

10−2

100

N

∥y−y∥2 L

2/∥y∥2 L

2

(d) g3D

50 100 150 200

10−3

10−2

10−1

N

∥y−y∥2 L

2/∥y∥2 L

2

(e) g6D

Figure: Black: classical GPR with an optimized (but linear) polynomial trend.Red: GPR using a nested polynomial trend. Blue: TBCF with prescribed ranks.

g2D(x) = (1 − x2

1) cos(7x1) × (1 − x2

2) sin(5x2)g3D(x) = sin(πx1) + 7 sin(πx2)2 + 0.1(πx3)4 sin(πx1)g6D(x) = g(1) ○ g(2)(x),g(1)(z) = 0.1 cos( 6

∑i=1

zi) + 6

∑i=1

z2i ,

g(2)(x) = (cos(πx1 + 1), cos(πx2 + 2), . . . , cos(πx6 + 6)) .∣ March 7th , 2019 ∣ PAGE 36/44

Analytical examples

20 40 60 80 100 120 14010

−6

10−5

10−4

10−3

10−2

10−1

100

N

∥y−y∥2 L

2/∥y∥2 L

2

(a) g2D

30 40 50 60 70 80 90 10010

−10

10−8

10−6

10−4

10−2

100

N

∥y−y∥2 L

2/∥y∥2 L

2

(b) g3D

100 200 300 400 500 600 700 800

10−10

10−5

N

∥y−y∥2 L

2/∥y∥2 L

2

(c) g6D

Figure: Black: classical GPR with an optimized (but linear) polynomial trend.Red: GPR using a nested polynomial trend. Blue: TBCF with prescribed ranks.

g2D(x) = (1 − x2

1) cos(7x1) × (1 − x2

2) sin(5x2)g3D(x) = sin(πx1) + 7 sin(πx2)2 + 0.1(πx3)4 sin(πx1)g6D(x) = g(1) ○ g(2)(x),g(1)(z) = 0.1 cos( 6

∑i=1

zi) + 6

∑i=1

z2i ,

g(2)(x) = (cos(πx1 + 1), cos(πx2 + 2), . . . , cos(πx6 + 6)) .∣ March 7th , 2019 ∣ PAGE 36/44

Analytical examples

Control of the error

There are two sources for the approximation error.

Under additional hypotheses on the regularity of y, it is possible to

propose values of Mα guaranteeing that E [∥y −ΠUα

y∥2H] ≤ Cr

−pα ,

p ≥ 1. For the moment, the quality of the empirical principal subspaceUα is controlled using cross-validation procedures, and in practice, wetake Mα = O(rα).The quality of the LS projection strongly depends on the positions ofthe points where y is evaluated. Boosted strategies based on optimalsampling [Cohen and Migliorati, 2017] can also be proposed to makethe best compromise between error control (↔E [∥z(⋅,x(m)αc ) −ΠVαz(⋅,x(m)αc )∥2

Hα

] small) and efficiency (↔ Pα

small).

∣ March 7th , 2019 ∣ PAGE 37/44

Ongoing and future works (1/3)

Cases when the inner structure is unknown

The former algorithms assume that the code inner structure is known.

If it is unknown, a tree-based tensor format has to be proposed a

priori.

However, without surprise, the approximation error can stronglydepend on this choice of the tree.

⇒ an important challenge concerning these TBCF concerns thedevelopment of efficient methods to infer the ”optimal tree” for y.

∣ March 7th , 2019 ∣ PAGE 38/44


y(x) =r{1,2,3,4,5}

∑i1=1

c(i1){1,2,3,4,5}

u(i1){1,2,3}

(x{1,2,3})u(i1){4,5}(x{4,5})

u(i1)

{1,2,3}(x{1,2,3}) =

r{1,2,3}

∑i21=1

c(i21,i1)

{1,2,3}u(i21)

{1}(x1)u(i21){2,3}

(x{2,3})

u(i1)

{4,5}(x{4,5}) =

r{4,5}

∑i22=1

c(i22,i1)

{4,5}u(i22)

{4}(x4)u(i22){5}

(x5)

u(i21)

{4,5}(x{2,3}) =

r{2,3}

∑i3=1

c(i3,i21)

{2,3}u(i3)

{2}(x2)u(i3){3} (x3)

⇒ y ∈ span(v{1}i1×⋯× v

{d}id

, 1 ≤ i1 ≤Q1, . . . ,1 ≤ id ≤ Qd).When considering nested codes, it can be difficult to identifysmall-dimensional approximation spaces for the leaves when onlyconsidering multilinear representations. How to add non-linearity whilekeeping stable identification algorithms ?

∣ March 7th , 2019 ∣ PAGE 39/44


1 Introduction



4 Conclusion

∣ March 7th , 2019 ∣ PAGE 40/44

Outline

This presentation focuses on the prediction of the output of a codenetwork.

Taking into account the intermediary observations can strongly reducethe prediction error when considering GP networks.

Being able to launch separately each node of the code network alsoopens opportunities to construct emulators at a reducedcomputational cost.

When the intermediary information is not available, the code structurecan be taken into account for the statistical learning as a dimensiontree to approximate the code output in a tree-based tensor format.

∣ March 7th , 2019 ∣ PAGE 41/44

Conclusion

Cohen, A. and Migliorati, G. (2017).Optimal weighted least-squares methods.SMAI Journal of Computational Mathematics, 3:181–203.

Cutajar, K., Bonilla, E., Michiardi, P., and Filippone, M. (2016).Random feature expansions for deep Gaussian processes.arXiv:1610.04386.

Grelier, E., Nouy, A., and Chevreuil, M. (2019).Learning with tree-based tensor formats.arXiv:1811.04455.

Marque-Pucheu, S., Perrin, G., and Garnier, J. (2018).Efficient sequential experimental design for surrogate modeling ofnested codes.ESAIM: Probability and Statistics.

∣ March 7th , 2019 ∣ PAGE 42/44

References I

Marque-Pucheu, S., Perrin, G., and Garnier, J. (2019).An efficient dimension reduction for the sequential improvement of agaussian process emulator of two nested codes with functional outputs.

submitted to Computational Statistics.

Nouy, A. (2019).Higher-order principal component analysis for the approximation oftensors in tree-based low-rank formats.Numerische Mathematik, 141(3):743–789.

Perrin, G. and Cannamela, C. (2017).A repulsion-based method for the definition and the enrichment ofopotimized space filling designs in constrained input spaces.Journal de la Societe Francaise de Statistique, 158(1):37–67.

∣ March 7th , 2019 ∣ PAGE 43/44

References II

Perrin, G., Soize, C., Marque-Pucheu, S., and Garnier, J. (2017).Nested polynomial trends for the improvement of gaussianprocess-based predictors.Journal of Computational Physics, 346:389–402.

∣ March 7th , 2019 ∣ PAGE 44/44

References III

Thank you for yourattention!

Commissariat a l’energie atomique et aux energies alternatives

Centre de Saclay ∣ 91191 Gif-sur-Yvette Cedex

Etablissement public a caractere industriel et commercial ∣ RCS Paris B 775 685 019

Exploiting code structure for statistical learningzoltan.szabo/jc/2019_03_07_Guillaume… ·...

Documents

Transcript of Exploiting code structure for statistical learningzoltan.szabo/jc/2019_03_07_Guillaume… ·...