By Xiaowu Dai arXiv:1802.00021v2 [stat.ME] 24 Sep 2018 · 2018. 9. 26. · By Xiaowu Dai,y and...

ANOTHER LOOK AT STATISTICAL CALIBRATION: ANON-ASYMPTOTIC THEORY AND

PREDICTION-ORIENTED OPTIMALITY

By Xiaowu Dai∗,† and Peter Chien†

University of Wisconsin-Madison

We provide another look at the statistical calibration problem incomputer models. This viewpoint is inspired by two overarching prac-tical considerations of computer models: (i) many computer modelsare inadequate for perfectly modeling physical systems, even with thebest-tuned calibration parameters; (ii) only a finite number of datapoints are available from the physical experiment associated witha computer model. Following this new line of thinking, we providea non-asymptotic theory and derive a prediction-oriented calibra-tion method. Our calibration method minimizes the predictive meansquared error for a finite sample size with statistical guarantees. Weintroduce an algorithm to perform the proposed calibration methodand connect it to existing Bayesian calibration methods. Syntheticand real examples are provided to corroborate the derived theoryand illustrate some advantages of the proposed calibration method.

1. Introduction. In engineering and sciences, computer models are in-creasingly used for studying complex physical systems such as those in cos-mology, weather forecasting, material science, and shock physics [17]. Let Ydenote the output from a physical system ζ(·) with a set of inputs X. Fori = 1, . . . , n, assume Yi has the following form:

(1.1) Yi = ζ(Xi) + εi,

where the random error εi follows the independent N(0, σ2) distribution

and the design points Xi has support on Ω = [0, 1]d. Let η(x, θ) denote a

computer model for approximating ζ(x) with inputs x ∈ Ω and calibrationparameters θ ∈ Θ ⊂ Rp. The values of θ cannot be directly measured andtypically unknown in the physical data. As George Box famously stated “Allmodels are wrong, but essentially some are useful”, even the best computer

∗Supported in part by NSF Grant DMS-1308877.†Supported in part by NSF Grant DMS-1564376.MSC 2010 subject classifications: Primary 62F35, 62P30; secondary 62G08, 62F10.Keywords and phrases: Calibration, computer experiments, identifiability, model dis-

crepancy, prediction, non-asymptotic theory, reproducing kernel Hilbert space

1

arX

iv:1

802.

0002

1v2

[st

at.M

E]

24

Sep

2018

2 X. DAI AND P. CHIEN

models are only approximations of reality. It is possible to enhance the qual-ity of the computer model η(x, θ) by tuning or calibrating the calibrationparameters θ. But in most practical scenarios, the computer output η(x, θ)cannot fit the physical response ζ(x) perfectly, regardless how the calibrationparameters θ are best tuned [9, 17].

Another practical fact is only a finite number, n, of data points are avail-able from the physical experiment in (1.1) to tune θ.

By simultaneously following these two practical considerations, we takea new look at the calibration problem. Our purpose is to optimally pre-dict ζ(·) by calibrating θ in η(·, θ) and estimating the model discrepancyζ(x)− η(x, θ) based on a finite number of physical data. To this end, we usenonparametric approaches to modeling the physical system and the discrep-ancy function. We establish a non-asymptotic minimax estimation risk fornonparametric regression and achieve the optimal risk by using regularizedestimators in the reproducing kernel Hilbert space (RKHS) [1, 21]. We showthat our problem of prediction oriented calibration is equivalent to findingthe minimizer of the model discrepancy under the RKHS norm. We furtherestablish an exact statistical guarantee in the sense that for a finite sampleof physical observations, the prediction error is minimized by using the com-puter model calibrated with the proposed method. Furthermore, we providean algorithm to estimate the optimal calibration parameters and the modeldiscrepancy.

1.1. Comparison with existing work. We discuss the differences betweenour frequentist calibration method and other frequentist calibration meth-ods in the literature. [7] considers calibration using a parametric form of dis-crepancy. We use a nonparametric approach to better modeling the physicalsystem and the model discrepancy. The L2-calibration method in [18] min-imizes the model discrepancy under the L2-norm. [22] performs calibrationby minimizing the model discrepancy under the empirical l2-norm. Beloware the main differences between [18, 22] and ours.

• Different method. Calibration in [18, 22] and our work minimizes thefollowing different norms of the model discrepancy: the L2-norm in[18], the empirical l2-norm in [22] and the RKHS norm in our method.• Different analysis. Theoretical results in [18, 22] are based on asymp-

totics theory assuming the number of physical observations go to in-finity. Our theory is based on finite-sample properties of calibrationand prediction following the fact that usually, only a finite number ofphysical data are available.• Different results. The L2-calibration in [18] minimizes the distance

ANOTHER LOOK AT STATISTICAL CALIBRATION 3

between the physical system and the imperfect computer model, butnot directly for predicting the physical system. [22] performs the leastsquare calibration and then estimates the model discrepancy in theRKHS. However, for a finite number of physical observations, the esti-mation error of discrepancy can be large. To overcome this difficulty,our calibration method minimizes the predictive mean squared errorfor a finite sample of physical data with statistical guarantees.

Bayesian calibration was studied by [9, 14, 5, 6, 8, 15, 19], among others.Our frequentist calibration method is easier to compute and complementsthese Bayesian methods. Furthermore, we will discuss an interesting con-nection between our method and these Bayesian methods in Section 4. Thislink offers new justification for observed successful prediction performanceof the Bayesian calibration methods in various fields.

Our non-asymptotic minimax theory is inspired by recent developmentsof concentration inequalities that provide valid statistical inference and es-timation results for finite samples. Existing research on concentration in-equalities typically addresses finite dimensional parameters for parametricmodels. Because our interest is computer model calibration, we develop anon-asymptotic minimax theory for nonparametric models.

The remainder of the article is organized as follows. In Section 2, wediscuss the identifiability issue and formulate a prediction-oriented optimalcalibration method. In Section 3, we establish a non-asymptotic minimaxtheory and apply it to the prediction-oriented calibration method. In Section4, we develop an algorithm for solving our calibration problem and builda connection between our method and the Bayesian calibration method. InSection 5, we provide synthetic and real examples to corroborate the derivedtheory and illustrate some advantages of the proposed calibration method.We give concluding remarks in Section 6. All other proofs and algorithm aredelegated to the supplementary material.

2. Prediction-oriented calibration. Since the computer model is im-perfect for modeling the physical system, η(x, θ) 6≡ ζ(x) for any θ ∈ Θ. Amodel discrepancy function δ(x, θ) = ζ(x)−η(x, θ) is commonly used [9, 13].Equivalently, write

(2.1) ζ(x) = η(x, θ) + δ(x, θ), ∀x ∈ Ω, θ ∈ Θ.

The goal is to accurately predict ζ(·) using the computer model, which re-quires calibrating θ and estimating δ(·, θ) simultaneously with a finite phys-ical sample in (1.1). Two main difficulties arise. The identifiability issue ofθ to be discussed in Section 2.1 and the non-negligible estimation error of


δ(·, θ) due to the finite sample of the physical data. These two issues motivateour calibration method in Section 2.2.

2.1. The identifiability issue. Suppose that ζ(·) resides in a RKHS (H, ‖·‖H) on Ω = [0, 1]d. One example ofH is the mth order Sobolev spaceWm2 (Ω)with 2m > d:

Wm2 (Ω) ={g(·) ∈ L2(Ω)

∣∣∣ ∂α1+···+αd∂α1x1 · · · ∂αdxd

g(·) ∈ L2(Ω),

∀α1, . . . αd ∈ N with α1 + · · ·+ αd ≤ m}.

Since η(·, θ) approximates the physical system ζ(·), we assume the followingregularity condition for η(·, θ).

Assumption 2.1. For any θ ∈ Θ, the computer model η(·, θ) ∈ H.

An analogy of Assumption 2.1 was already used in [15]. But instead ofassuming the function space is a RKHS, [15] considers a function space ofbounded mixed derivatives. Assumption 2.1 implies that for any θ ∈ Θ,δ(·, θ) = ζ(x) − η(x, θ) ∈ H. This observation leads to a potential identifi-ability issue for θ. For example, for two different θ1 6= θ2 ∈ Θ, their corre-sponding model discrepancies δ(x, θ1) = ζ(x)−η(x, θ1) and δ(x, θ2) = ζ(x)−η(x, θ2) are both inH. This implies that both (θ1, δ(·, θ1)), (θ2, δ(·, θ2)) ∈ Θ×H are true for model (2.1). There are infinite truth pairs (θ, δ(·, θ)) ∈ Θ×Hfor (2.1) by choosing arbitrary θ ∈ Θ and using δ(·, θ) = ζ(x) − η(x, θ).This identifiability issue was first noticed by K. Beven and P. Diggle in theirdiscussion of [9].

If Assumption 2.1 fails, our results hold for replacing δ(·, θ) = ζ(·)−η(·, θ)with δ(·, θ) = PH{ζ(·)−η(·, θ)}, where PH denotes the projection from L2(Ω)to H in the L2-norm. Hereinafter, we focus on the imperfect computer modelunder Assumption 2.1.

2.2. Definition of prediction-oriented calibration. Denote by Π the sam-pling distribution of Xi in (1.1) which is independent of εi and satisfiesΠ(Ω) = 1. Assume that the density of Π is bounded away from zero andinfinity on Ω. Let X∗ be drawn from Π and Y ∗ = ζ(X∗) + ε∗ = η(X∗, θ) +δ(X∗, θ) + ε∗ with ε∗ ∼ N(0, σ2). Then for a fixed θ ∈ Θ, the minimalpredictive mean squared error for predicting Y ∗ is

(2.2)infδ̃E{Y ∗ − [η(X∗, θ) + δ̃(X∗, θ)]

}2= σ2 + inf

δ̃‖δ(·, θ)− δ̃(·, θ)‖2L2(Π),


where the infimum is over all measurable estimators.The identifiability issue discussed in Section 2.1 indicates that there are

infinite pairs of (θ, δ(·, θ)) ∈ Θ × H satisfying model (2.1). However, for afinite sample size n, the minimal estimation error inf

δ̃‖δ(·, θ)− δ̃(·, θ)‖L2(Π)

does not vanish [2]. This fact motivates us to define optimal calibrationθopt-pred to minimize the minimal predictive mean squared error (2.2) overθ ∈ Θ. Equivalently, we define

(2.3) θopt-pred ≡ arg minθ∈Θ

{infδ̃‖δ(·, θ)− δ̃(·, θ)‖L2(Π)

},

where the superscript “opt-pred” denotes “optimal for prediction”. By def-inition, θopt-pred in (2.3) is optimal for predicting ζ(·). We will establish anon-asymptotic theory for nonparametric regression in Section 3 and intro-duce an algorithm in Section 4. The formulation of θopt-pred in (2.3) is fre-quentist. We will discuss in Section 3.4 the differences between θopt-pred andother frequentist calibration methods, including the L2-calibration method[18] and the least square calibration method [22].

3. Main results. Existing theory for calibration, including [18, 22],adopts the asymptotic regime that the number of data points for the phys-ical experiment goes to infinity. In Section 3.1, we present non-asymptoticminimax risk for nonparametric models and apply it the model calibrationproblem in Section 3.2. In Section 3.3, we show the improvement in predic-tion achieved by incorporating data from computer models. In Section 3.4,we compare our calibration method with some existing frequentist methods.

3.1. Non-asymptotic minimax theory for nonparametric regressions. Weconsider the nonparametric regression in (1.1), where ζ(·) resides in theRKHS H. Let ‖ · ‖H be the corresponding RKHS norm and K : Ω×Ω→ Rbe a Mercer kernel generating (H, ‖·‖H). By the spectral theorem, K admitsthe eigenvalue decomposition:

K(x, x′) =∑ν≥1

λνφν(x)φν(x′),

where λ1 ≥ λ2 ≥ · · · ≥ 0 are the eigenvalues and {φν : ν ≥ 1} are thecorresponding eigenfunctions such that 〈φν , φν′〉L2(Π) = δνν′ . Here, δνν′ isthe Kronecker delta. We assume the polynomial decay rate of eigenvaluesin Assumption 3.1. This common assumption holds for the Sobolev spaceH =Wm2 (Ω) with Lebesgue measure on Ω.


Assumption 3.1. For 2m > d, suppose that for any ν ≥ 1, the eigen-values satisfy cλν

−2m/d ≤ λν ≤ Cλν−2m/d with constants 0 < cλ < Cλ 0 not depending onn, σ,m, d, ‖ζ‖H such that for any n ≥ 1,

infζ̃

supζ∈H

P{‖ζ̃ − ζ‖2L2(Π) ≥ C1

[1 + n−

2m−d4m+2d (1 + σ/‖ζ‖H)−

2d2m+d

]2·n−

2m2m+d (‖ζ‖H + σ)2 (1 + σ/‖ζ‖H)−

2d2m+d

}> 0.

The proof of this theorem, given in the supplementary material, is basedon Fano’s lemma [2]. Now, we show that the non-asymptotic lower boundof the theorem can be achieved by the regularized estimator in the RKHS:

(3.1) ζ̂nλ = arg ming∈H

{1

n

n∑i=1

[Yi − g(Xi)]2 + λ‖g‖2H

},

where λ > 0 is the tuning parameter. The following result shows that ζ̂nλ isminimax optimal for a finite sample.

Theorem 3.2. Under the regression model (1.1) where ζ(·) ∈ H and As-sumption 3.1, there exists a constant C2 > 0 not depending on n, σ,m, d, ‖ζ‖Hsuch that for any n ≥ 1 and any α > 0 with probability at least 1− 8e−α2,

‖ζ̂nλ − ζ‖2L2(Π) ≤ C2[1 + α

2m−d2m+dn−

2m−d4m+2d (1 + σ/‖ζ‖H)−

2d2m+d

]2· α

4m2m+dn−

2m2m+d (‖ζ‖H + σ)2 (1 + σ/‖ζ‖H)−

2d2m+d ,

where ζ̂nλ is defined by (3.1) with λ = n−2m/(2m+d){4m1/2(4m−d)−1/2Cd/4mλ ·

[αA+ σcφ(1 +√

2α)/‖ζ‖H]}4m/(2m+d). Here, A is a constant not dependingon n, σ, ‖ζ‖H.

We relegate the proof of Theorem 3.2 to the supplementary material. Weestablish the proof by using results from empirical processes such as the


maximal inequalities and the concentration inequalities [11] and derivingsome new techniques.

We make three remarks on this theorem. First, the conventional asymp-totic convergence rate can be recovered from Theorem 3.2. Since 2m > d inAssumption 3.1,

α2m−d2m+dn−

2m−d4m+2d (1 + σ/‖ζ‖H)−

2d2m+d = o(1), as n→∞.

Thus, Theorem 3.2 yields that ‖ζ̂nλ−ζ‖2L2(Π) = OP{n−2m/(2m+d)} as n→∞.

This is a well-known rate ([3, 20, 21]).Second, Theorems 3.1 and 3.2 together immediate imply that the non-

asymptotic minimax optimal risk for estimating ζ ∈ H is

(3.2)‖ζ̃ − ζ‖2L2(Π) = C

[1 + n−

2m−d4m+2d (1 + σ/‖ζ‖H)−

2d2m+d

]2· n−

2m2m+d (‖ζ‖H + σ)2 (1 + σ/‖ζ‖H)−

2d2m+d ,

for some constant C. This minimax risk holds for a finite sample and indi-cates the dependence on the signal-to-noise ratio: ‖ζ‖H/σ and the magnitudeof signal and noise: ‖ζ‖H + σ.

Third, Theorem 3.2 suggests a tradeoff between the approximation er-ror and the prediction error. A smaller λ in (3.1) corresponds to a smallerapproximation error: 1n

∑ni=1[Yi − ζ̂nλ(Xi)]2; see, e.g., [21]. In practice, we

usually do not know the values of ‖ζ‖H or σ. Theorem 3.2 implies that forachieving the minimax optimal risk (3.2), the optimal λ with a smaller valuecorresponds to a larger ‖ζ‖H, which yields a larger risk for predction in (3.2).

3.2. Optimal calibration and prediction. We apply the non-asymptoticminimax theory in Section 3.1 to derive an equivalent form for the calibrationmethod in (2.3). The following theorem leads to an algorithm for computingθopt-pred in Section 4.

Theorem 3.3. Under the Assumptions in 2.1 and 3.1, for any n ≥ 1,the calibration procedure in (2.3) is equivalent to finding the minimizer ofthe model discrepancy equipped with the RKHS norm:

θopt-pred = arg minθ∈Θ

{‖ζ(·)− η(·, θ)‖H} .

The proof of this theorem is relegated to the supplementary material.Here are some explanations. If the model discrepancy ζ(·) − η(·, θ) has asmall RKHS norm, the discrepancy is a simple function in the RKHS. Since


the sample size n is fixed and finite, a simpler function should have a moreaccurate estimator by using limited information from data. As discussed inSection 2, a more accurate discrepancy estimator gives a smaller predictionerror for the physical system. Unlike the existing work in model calibrations,Theorem 3.3 justifies the use of RKHS norm to measure the model discrep-ancy, which is different from the L2-norm [18] and the empirical l2-norm[22].

We now discuss the non-asymptotic minimax risk for predicting ζ(·) basedon the calibration parameters θopt-pred. As a corollary of Theorem 3.1, thefollowing result gives a lower bound of the predictive mean squared errorwhen using the computer model calibrated by θopt-pred and an estimator ofthe model discrepancy.

Corollary 3.4. Under the regression model (1.1) where ζ(·) ∈ H andAssumption 2.1 and 3.1 hold, then for any n ≥ 1,

infδ̃

supζ∈H

P{‖[η(·, θopt-pred) + δ̃(·)]− ζ(·)‖2L2(Π)

≥ C1[1 + n−

2m−d4m+2d

(1 + σ/‖ζ − η(·, θopt-pred)‖H

)− 2d2m+d

]2n−

2m2m+d

·(‖ζ − η(·, θopt-pred)‖H + σ

)2 (1 + σ/‖ζ − η(·, θopt-pred)‖H

)− 2d2m+d

}> 0.

Here, the constant C1 is the same as in Theorem 3.1 that does not dependon n, σ,m, d, ‖ζ − η(·, θopt-pred)‖H.

As a corollary of Theorem 3.2, we show that the non-asymptotic lowerbound in Corollary 3.4 can be achieved by the regularized estimator for themodel discrepancy in the RKHS H:

(3.3) δ̂nλ(·, θ) = arg minh∈H

{1

n

n∑i=1

[Yi − η(Xi, θ)− h(Xi)]2 + λ‖h‖2H

},

where λ > 0 is the tuning parameter.

Corollary 3.5. Under the regression model (1.1) where ζ(·) ∈ H andAssumption 2.1 and 3.1, for any n ≥ 1 and any α > 0 with probability at


least 1− 8e−α2,

‖[η(·, θopt-pred) + δ̂nλ(·, θopt-pred)]− ζ(·)‖2L2(Π)

≤ C2[1 + α

2m−d2m+dn−

2m−d4m+2d

(1 + σ/‖ζ(·)− η(·, θopt-pred)‖H

)− 2d2m+d

]2α

4m2m+d

· n−2m

2m+d

(‖ζ(·)− η(·, θopt-pred)‖H + σ

)2 (1 + σ/‖ζ(·)− η(·, θopt-pred)‖H

)− 2d2m+d

,

where δ̂nλ is defined by (3.3) with λ = n−2m/(2m+d){4m1/2(4m − d)−1/2 ·

Cd/4mλ [αA+σcφ(1+

√2α)/‖ζ(·)−η(·, θopt-pred)‖H]}4m/(2m+d). Here, C2 and A

are the same as in Theorem 3.2 that do not depend on n, σ, ‖ζ−η(·, θopt-pred)‖H.

Corollary 3.4 and 3.5 together imply that the non-asymptotic minimaxoptimal risk for predicting ζ ∈ H by the computer model with θopt-pred is

‖[η(·, θopt-pred) + δ̃(·)]− ζ‖2L2(Π)

= C ′

[1 + n−

2m−d4m+2d

(1 +

σ

minθ∈Θ ‖ζ − η(·, θ)‖H

)− 2d2m+d

]2

· n−2m

2m+d

(minθ∈Θ‖ζ − η(·, θ)‖H + σ

)2(1 +

σ

minθ∈Θ ‖ζ − η(·, θ)‖H

)− 2d2m+d

.

Moreover, this minimax optimal risk can be obtained by

(3.4) ζopt-prednλ (·) ≡ η(·, θopt-pred) + δ̂nλ(·, θopt-pred),

where δ̂nλ is defined by (3.3).

3.3. Improved prediction using the computer model. The minimax opti-mal predictor (3.4) combines the merits of parametrics (for computer model)and nonparametrics (for the discrepancy estimator). We compare (3.4) withits counterpart of the minimax optimal predictor (3.1) without using thecomputer model.

Theorem 3.6. Under the regression model (1.1) where ζ(·) ∈ H andAssumption 2.1 and 3.1, if

(3.5) minθ∈Θ‖ζ(·)− η(·, θ)‖H < ‖ζ‖H,

then ζopt-prednλ (·) defined by (3.4) with the aid of the computer model achieves asmaller upper bound of the predictive mean squared error than ζ̂nλ(·) definedby (3.1) without using the information of the computer model.


The proof of this theorem is given in the supplementary material. Here isa remark on the condition (3.5). The computer model η(·, θ) is built based onsome physics for the system ζ(·) and Theorem 3.3 shows that η(·, θopt-pred)is the best approximation to ζ(·) within the family {η(·, θ), θ ∈ Θ}. Al-though the computer model is imperfect for modeling the physical system,η(·, θopt-pred) can still capture some major shape of ζ(·) and consequentlyζ(·) − η(·, θopt-pred) has less variation or smoother in H than the originalζ(·) does. Hence, we assume that ‖ζ(·) − η(·, θopt-pred)‖H < ‖ζ‖H, which isexactly the condition (3.5).

The predictor (3.4) is a parametrically-guided nonparametric predictor,where η(·, θ) can have a parametric form.

3.4. Comparison with existing frequentist calibration methods. We com-pare the calibration procedure in θopt-pred defined by (2.3) with two otherfrequentist calibration methods: the L2-calibration and the least square cal-ibration. The L2-calibration method in [18] is defined as follows:

θ̂L2n = arg minθ∈Θ

{‖ζ̂nλ(·)− η(·, θ)‖L2(Π)

},

where ζ̂nλ(·) is defined by (3.1). The least square calibration method in [22]minimizes the model discrepancy equipped with the empirical l2-norm:

(3.6) θ̂l2n = arg minθ∈Θ

{1

n

n∑i=1

[Yi − η(Xi, θ)]2}.

In particular, [22] estimates the model discrepancy after calibrating θ = θ̂l2n :

δ̂nλ(·, θ̂l2n ) = arg minδ(·)∈H

{1

n

n∑i=1

[Yi − η(Xi, θ̂l2n )− δ(Xi)] + λ‖δ‖2H

},

and a predictor for ζ(·) is given by η(·, θ̂l2n ) + δ̂nλ(·, θ̂l2n ).

Remark 3.1. The differences between θopt-pred, θ̂L2n , and θ̂l2n are as fol-

lows.

• Calibration results. We have that θopt-pred = arg minθ∈Θ

{‖ζ(·)− η(·, θ)‖H}

and

P{θ̂L2n , θ̂

l2n → θL2 ≡ arg min

θ∈Θ

{‖ζ(·)− η(·, θ)‖L2(Π)

}as n→∞

}> 0.

Since θ̂L2n and θ̂l2n converge to a different minimizer than θ

opt-pred, thecalibration result in θopt-pred is different from θ̂L2n and θ̂

l2n .


• Prediction. For a finite sample size n, the predictor with θopt-predachieves a smaller upper bound of the predictive mean squared er-ror comparing with the predictor with θ̂L2n in [18] and the predictorwith θ̂l2n in [22].

We provide a proof of Remark 3.1 in the supplementary material.

4. An algorithm. We propose an algorithm to compute the optimalcalibration θopt-pred in (2.3). From Theorem 3.3,

(4.1) θopt-pred = arg minθ∈Θ

{‖δ(·, θ)‖2H

},

where the discrepancy δ(·, θ) is subject to the constraint (2.1). By evaluating(2.1) at the training data, we have

(4.2) Yi = ζ(Xi) + εi = η(Xi, θ) + δ(Xi, θ) + εi, ∀θ ∈ Θ, i = 1, . . . , n.

By using the Lagrange multiplier method for (4.1) with the constraint (4.2),we need to find θ ∈ Θ and δ(·) ∈ H for minimizing

(4.3)1

n

n∑i=1

[Yi − η(Xi, θ)− δ(Xi)]2 + λ‖δ‖2H,

where λ > 0 is a tuning parameter.The optimization problem in (4.3) can be solved iteratively as follows.

We introduce some notation. Recall that K is the reproducing kernel of(H, ‖ · ‖H). Let Σ be the n×n kernel matrix with ijth entry K(Xi, Xj). De-note by ~Y = (Y1, . . . , Yn)

>, η( ~X, θ) = (η(X1, θ), . . . , η(Xn, θ))>, δ( ~X, θ) =

(δ(X1, θ), . . . , δ(Xn, θ))>, and ζ( ~X) = (ζ(X1), . . . , ζ(Xn))

>.For any fixed θ ∈ Θ, the minimizer δ(·) of (4.3) is the same as the regu-

larized estimator δ̂nλ(·, θ) in (3.3). By the representer lemma [10],

(4.4) δ̂nλ(·, θ) =n∑i=1

ciK(Xi, ·),

where the coefficient c = (c1, . . . , cn)> ∈ Rn is given by

(4.5) c = c(θ) = (Σ + nλI)−1[~Y − η( ~X, θ)].

The tuning parameter λ can be selected by the generalized cross-validation(GCV) in [4]. Let A(λ) be the influence matrix satisfying δ̂nλ( ~X, θ) =A(λ)[~Y − η( ~X, θ)]. The GCV estimate of λ is the minimizer of

(4.6) GCV(λ) =n−1‖~Y − η( ~X, θ)− δ̂nλ( ~X, θ)‖2

[n−1tr(I −A(λ))]2.


GCV gives a consistent estimate of the optimal λ in Corollary 3.5 [12, 21].Another popular technique for selecting the tuning parameter is fivefold ortenfold cross-validation. Since the computational load of GCV is smaller, itwill be used for simulations in Section 5.

For any fixed δ(·) = δ̂nλ(·, θ) from (4.4), the minimizer θ of (4.3) is thesame as

(4.7) arg minθ∈Θ

{(~Y − η( ~X, θ))>(Σ + nλ)−1(~Y − η( ~X, θ))

}.

Since the objective function in (4.7) is a weighted version of the empiricall2-norm ‖~Y − η( ~X, θ)‖2/n, (4.7) gives a different calibration result than theleast square calibration method (3.6).

Putting the above building blocks together, our algorithm for optimizing(4.3) iterates between (4.4) and (4.7). In each iteration, (4.3) is decreased.The algorithm can start with the calibration parameters from the leastsquare calibration method. Applying later iterations of the algorithm con-tinuously improves the initial values for prediction. Our experience indicatesthat a small number of iterations is sufficient to obtain good performance ofthe algorithm. In our simulation studies, we find that the objective function(4.3) decreases fast in the first iteration and becomes close to the objectivefunction at convergence. This motivates a one-step update procedure:

1. Initialization: Solve the least square calibration problem in (3.6) toobtain θ = θ̂l2n .

2. Solve for δ̂nλ(·, θ) in (4.4), and tune λ according to GCV. Fix λ at thechosen value in later steps.

3. For δ̂nλ(·, θ) obtained in step 2, solve for θ in (4.7).4. With the new θ, solve for δ̂nλ(·, θ) in (4.4).

4.1. Connection with Bayesian calibration. We now link our frequentistcalibration method to the Bayesian calibration method in [9]. [9] uses Gaus-sian processes [16] as priors for the computer model η(·, θ) and the discrep-ancy δ(·, θ). Recall that K is the reproducing kernel of the RKHS (H, ‖·‖H).We consider the following priors

(4.8)

ζ(x) = η(x, θ) + δ(x),

where η(x, θ) =

p∑j=1

θjhj(x), θ ∼ N(0, αI), δ(x) ∼ N(0, βK(·, ·)).

Here, hj(x) are deterministic functions and α, β are positive hyperparam-eters. The following proposition builds a connection between the Bayesian


calibration method and the optimization step in (4.3) for our calibrationmethod.

Proposition 4.1. Suppose that the prior for ζ(·) is given by (4.8) anddenote the posterior mean ζ̂α(x) = E[ζ(x)|Y1, . . . , Yn]. Let the predictor cor-responding to the optimal calibration be ζ̂opt-prednλ (·) = η(·, θ̂

opt-pred) + δ̂nλ(·),which is an estimator of (3.4) and

(θ̂opt-pred, δ̂nλ) = arg minθ∈Θ,δ∈H

{1

n

n∑i=1

[Yi − η(Xi, θ)− δ(Xi)]2 + λ‖δ‖2H

}

with λ = σ2/nβ. Here, α and β are defined in (4.8). Then, limα→∞ ζ̂α(x) =

ζ̂opt-prednλ (x) for any x ∈ Ω.

The proof of this proposition is given in the supplementary material.The proposition sheds new lights on why the prediction of the Bayesiancalibration method works well in practice.

The Bayesian calibration method is more time consuming to computethan frequentist calibration methods such as ours.

5. Simulation and real examples. We illustrate of the predictionaccuracy of the proposed calibration method using several examples. Oursimulation study consists of Example 5.1, 5.2, 5.3, where the tuning param-eters for regularized estimators are selected by the GCV. The predictionaccuracy is measured by the predictive mean squared error estimated by aMonte Carlo sample of 1000,000 test points from the same distribution asthe training points. A real data example is given in Example 5.4.

Example 5.1. Consider a physical system ζ(x) = exp(πx/5) sin 2πx, x ∈[0, 1]. The physical data are generated by (1.1) with Xi ∼ Unif([0, 1]), εi ∼N(0, σ2) for i = 1, . . . , 50. Four different noise variances are investigated:σ2 = 0.1, 0.25, 0.5, 1. Suppose that the computer model is

η(x, θ) = ζ(x)−√θ2 − θ + 1(sin 2πθx+ cos 2πθx) for θ ∈ [−1, 1].

Since θ2 − θ + 1 ≥ 3/4 for any −1 ≤ θ ≤ 1, the model discrepancy betweenη(·, θ) and ζ(·) always exists no matter how θ is chosen. We use the Matérnkernel K(x1, x2) = (1 + |x1 − x2|/ψ) exp{−|x1 − x2|/ψ}, where the scaleparameter ψ is chosen by the five-fold cross-validation. Figure 1 plots thesquared model discrepancy with different norms: ‖ζ(·) − η(·, θ)‖2L2(Π) and‖ζ(·)− η(·, θ)‖2H. The corresponding minimizers are different given as θL2 ≈


−0.1780 and θopt-pred ≈ 0.3740 (a local minimizer of ‖ζ(·) − η(·, θ)‖2H in[−0.4, 0] is θopt-pred ≈ −0.1230), which illustrates the first part of Theorem3.1.

Fig 1: Normalized model discrepancy equipped with L2(Π)-norm and RKHS-norm in Example 5.1.

We compare the prediction accuracy of four frequentist calibration meth-ods:

1) The computer model with L2-calibration (abbreviated as “No BiasCorr.”), which is the η(·, θ̂L2n ) in Section 3.4 without the model dis-crepancy correction;

2) The nonparametric predictor (abbreviated as “N.P.”), which is theζ̂nλ(·) obtained by (3.1) in Section 3.1;

3) The predictor in [22] (abbreviated as “LS. Cal.”), which is the η(·, θ̂l2n )+δ̂nλ(·, θ̂l2n ) in Section 3.4 based on the least square calibration;

4) Our predictor (abbreviated as “Opt. Cal.”), which is the ζopt-prednλ inSection 3.2 based on the proposed calibration and computed by thealgorithm in Section 4.

For each chosen noise variance, we replicate the data generation, calibra-tion and prediction procedures 1,000 times and average the results for eachmethod across the replicates. The resulting average predictive mean squarederrors and its associated standard errors are given in Table 1. The computermodel with L2-calibration (i.e., “No Bias Corr.”) gives the largest predictive


mean squared error, which shows that an estimator for the model discrep-ancy is necessary. “No Bias Corr.” has the smallest standard error whichis not surprising because the parametric estimation of θ here has a fasterrate of convergences than nonparametric estimations required for other threemethods. Table 1 indicates that our method “Opt. Cal.” and the predictorin [22] “LS. Cal.” outperform the nonparametric predictor “N.P.”. Thisadvantage illustrates the improved predictions by computer models as in-dicated in Theorem 3.6. Furthermore, “Opt. Cal.” gives smaller predictionerrors than “LS. Cal.”, which agrees with the second part of Theorem 3.1.Overall, for a finite sample size n = 50 , the proposed method “Opt. Cal.”outperforms the other frequentist predictors in the settings studied.

Table 1Comparison of predictive mean squared errors for Example 5.1. PMSE = predictive mean

squared error, SE = standard error

σ2 = 0.1 σ2 = 0.25Average of PMSE SE of PMSE Average of PMSE SE of PMSE

No Bias Corr. 0.3492 0.0174 0.3601 0.0270

N.P. 0.1701 0.0594 0.2327 0.0680

LS. Cal. 0.1152 0.0604 0.1693 0.0580

Opt. Cal. 0.0922 0.0515 0.1434 0.0552

σ2 = 0.5 σ2 = 1Average of PMSE SE of PMSE Average of PMSE SE of PMSE

No Bias Corr. 0.3744 0.0407 0.3985 0.0666

N.P. 0.2810 0.0930 0.3417 0.1367

LS. Cal. 0.1924 0.0683 0.2074 0.0853

Opt. Cal. 0.1684 0.0656 0.1838 0.0788

Example 5.2. Consider a two-dimensional physical system

ζ(x1, x2) =2

3exp(x1 + 0.2)− x2 sin 0.4 + 0.4

+ exp(−x1)(x1 +

1

2

)(x22 + x2 + 1

), (x1, x2) ∈ [0, 1]2.

Suppose that the computer model is

η(x1, x2; θ1, θ2) =2

3exp(x1 + θ1)− x2 sin θ2 + θ2, (θ1, θ2) ∈ [0, 1]2.


The model discrepancy exists between η(·; θ1, θ2) and ζ(·). Assume the phys-ical data are generated by (1.1) with a uniform design on [0, 1]2 and n = 50.We consider four levels of σ2 = 0.03, 0.05, 0.07, 0.1, and use the Matérn ker-nel K(x1, x2) = (1 + ‖x1−x2‖/ψ) exp{−‖x1−x2‖/ψ} with ψ chosen by thefive-fold cross-validation. For each level of σ2, we replicate the data genera-tion, calibration and prediction procedures 1,000 times for all the methodsand average the results for each method across the replicates. Table 2 sum-marizes the results, where the abbreviations of the methods are the same asthose in Example 5.1. Once again, the proposed method “Opt. Cal.” out-performs the other three frequentist calibration methods in terms of thepredictive mean squared error.




No Bias Corr. 0.1690 0.0027 0.1691 0.0027

N.P. 0.1155 0.0512 0.1823 0.0850

LS. Cal. 0.0611 0.0198 0.0743 0.0212

Opt. Cal. 0.0564 0.0187 0.0691 0.0205


No Bias Corr. 0.1692 0.0028 0.1694 0.0028

N.P. 0.2425 0.1159 0.3327 0.1453

LS. Cal. 0.0825 0.0224 0.0906 0.0235

Opt. Cal. 0.0776 0.0220 0.0863 0.0234

Example 5.3. We now compare the proposed calibration method withsome Bayesian calibration method. Consider a falling ball example in [15]where the physical system is

ζ(x) = 8 +5

2log

(50

49− 50

49tanh

(tanh−1(

√0.02) +

√2x)2)

, x ∈ [0, 1]

and the computer model derived from Newton’s second law is η(x; v0, g) =8 + v0x − gx2/2. Here, calibration parameters (v0, g) are the vertical ve-locity and the acceleration rate, respectively. The model discrepancy existsbetween ζ(·) and η(·; v0, g). Suppose that the physical data are generated


by (1.1) with a uniform design on [0, 1] and n = 30. We compare the pro-posed method “Opt. Cal.” with two Bayesian methods in terms of predictionaccuracy:

1) The Bayesian method in Kennedy and O’Hagan [9] (abbreviated as“KO”);

2) The Bayesian predictor using an orthogonal Gaussian process in L2-norm as the prior (abbreviated as “OGP”) proposed by [15].

The Matérn kernel K(x1, x2) = (1 + |x1 − x2|/ψ) exp{−|x1 − x2|/ψ} withparameter ψ = 1 is used as the reproducing kernel for “Opt. Cal.” andthe prior covariance function for both “KO” and “OGP”. Four levels ofσ2 = 0.0025, 0.01, 0.0625, 0.25 are considered. For each level of σ2, we repli-cate the data generation, calibration and prediction procedures 1,000 times.Table 3 summarizes the prediction results. Here “KO” has large predic-tion errors and large posterior variances compared with “OGP” and “Opt.Cal.”. “OGP” provides stable and accurate predictions and our method“Opt. Cal.” gives even smaller prediction errors for some σ2 levels.




KO 0.5413 2.3944 5.8596 29.0264

OGP 0.0147 0.0132 0.0230 0.0249

Opt. Cal. 0.0091 0.0084 0.0166 0.0168


KO 1.4644 4.3802 18.5433 97.3962

OGP 0.0500 0.0478 0.0952 0.0995

Opt. Cal. 0.0318 0.0338 0.0620 0.0677

Example 5.4 (Real data example). We analyze a real dataset from asingle voltage clamp experiment on sodium ion channels of cardiac cell mem-branes. This dataset consists of 19 outputs and is from [15]. The responsevariable is the normalized current for maintaining a fixed membrane poten-tial of −35mV and the input variable is the logarithm of time. Suppose thecomputer model for this experiment is the Markov model for sodium ionchannels given by η(x, θ) = e>1 exp(exp(x)A(θ))e4, where θ = (θ1, θ2, θ3)

> ∈


R3, e1 = (1, 0, 0, 0)>, e4 = (0, 0, 0, 1)>, and

A(θ) =

−θ2 − θ3 θ1 0 0

θ2 −θ1 − θ2 θ1 00 θ2 −θ1 − θ2 θ10 0 θ2 −θ1

For this example, we compare the frequentist methods in Example 5.1 and5.2, and Bayesian methods in Example 5.3. The Matérn kernel K(x1, x2) =(1 + |x1− x2|/ψ) exp{−|x1− x2|/ψ} with ψ = 1 is used for all methods andthe Metropolis-Hastings algorithm is applied to sample from the posteriorof θ for Bayesian methods. In each experiment, we perform five-fold cross-validation where the data is randomly partitioned into five roughly equal-sized parts: four parts are for training and the rest part is for testing. Thecross-validation process is repeated five times, with each of the five partsis used once for testing. Then, the five predictive mean squared errors areaveraged to give a single predictive mean squared error. We replicate thedata generation, calibration and prediction procedure 100 times and averagethe results.

Table 4 summarizes the prediction results, with the abbreviations of themethods given in Examples 5.1 and 5.3. “No Bias Corr.” gives the largestpredictive mean squared error among the four frequentist methods, indi-cating the existence of model discrepancy. Here all the frequentist methodsoutperform the two Bayesian methods. “No Bias Corr.” outperforms “LS.Cal.”. This indicates that if the calibration parameter is not chosen well,the use of computer model does not improve prediction. Overall, the pro-posed method ‘Opt. Cal.” gives the smallest prediction error among all themethods applied to this example.



Average of PMSE SE of PMSE

KO 0.0045 0.0131

OGP 9.4387 × 10−4 0.0022

No Bias Corr. 4.2823 × 10−4 6.2333 × 10−4

N.P. 2.5916 × 10−4 2.2522 × 10−4

LS. Cal. 3.0521 × 10−4 6.3797 × 10−4

Opt. Cal. 1.6323× 10−4 2.1335 × 10−4


6. Concluding remarks and discussions. We have proposed a newlook at the model calibration issue in computer models. This viewpoint si-multaneously considers two facts regarding how computer models are usedin practice: computer models are inadequate for physical systems, regardlesshow the calibration parameters are tuned; and only a finite number of datapoints are available from the physical experiment associated with a com-puter model. We establish a non-asymptotic minimax theory and derive anoptimal prediction-oriented calibration method. Through several examples,the proposed calibration method has some advantages in prediction whencompared with some existing calibration methods. We have developed analgorithm to carry out the proposed calibration method and built a linkbetween our method and the Bayesian calibration method. Beyond calibra-tion of computer models, our method can be applied to calibrate unknownparameters for general misspecified models in statistics and engineering. Inmany applications, bounded linear functional information such as derivativedata are observed or can be easily calculated together with the function ob-servations. It would be interesting to include all these data in our proposedcalibration method.

SUPPLEMENTARY MATERIAL

Supplement to “Another look at Statistical Calibration: Non-Asymptotic Theory and Prediction-Oriented Optimality”.(; .pdf). The supplementary materials contain proofs of technical results inthis paper.

REFERENCES

[1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68337–404. MR0051437

[2] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed.John Wiley & Sons, New York. MR2239987

[3] Cox, D. D. (1984). Multivariate smoothing spline functions. SIAM J. Numer. Anal.21 789–813. MR0749371

[4] Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions.Numer. Math. 31 377–403. MR0516581

[5] Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A. and Ryne, R. D.(2004). Combining field data and computer simulations for calibration and prediction.SIAM J. Sci. Comput. 26 448–466. MR2116355

[6] Higdon, D., Gattiker, J., Williams, B. and Rightley, M. (2008). Computermodel calibration using high-dimensional output. J. Amer. Statist. Assoc. 103 570–583. MR2523994

[7] Joseph, V. R. and Melkote, S. N. (2009). Statistical adjustments to engineeringmodels. Journal of Quality Technology 41 362–375.

[8] Joseph, V. R. and Yan, H. (2015). Engineering-driven statistical adjustment andcalibration. Technometrics 57 257–267. MR3369681

http://www.ams.org/mathscinet-getitem?mr=0051437http://www.ams.org/mathscinet-getitem?mr=2239987http://www.ams.org/mathscinet-getitem?mr=0749371http://www.ams.org/mathscinet-getitem?mr=0516581http://www.ams.org/mathscinet-getitem?mr=2116355http://www.ams.org/mathscinet-getitem?mr=2523994http://www.ams.org/mathscinet-getitem?mr=3369681


[9] Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models.J. R. Stat. Soc. Ser. B Stat. Methodol. 63 425–464. MR1858398

[10] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline func-tions. J. Math. Anal. Appl. 33 82–95. MR0290013

[11] Kosorok, M. R. (2008). Introduction to Empirical Processes and SemiparametricInference. Springer, New York. MR2724368

[12] Li, K.-C. (1985). From Stein’s unbiased risk estimates to the method of generalizedcross validation. Ann. Statist. 13 1352–1377. MR0811497

[13] Martins-Filho, C., Mishra, S. and Ullah, A. (2008). A class of improved para-metrically guided nonparametric regression estimators. Econometric Rev. 27 542–573.MR2439867

[14] Oakley, J. E. and O’Hagan, A. (2004). Probabilistic sensitivity analysis of complexmodels: a Bayesian approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 751–769.MR2088780

[15] Plumlee, M. (2017). Bayesian calibration of inexact computer models. J. Amer.Statist. Assoc. 112 1274–1285. MR3735376

[16] Sacks, J., Welch, W. J., Mitchell, T. J. and Wynn, H. P. (1989). Design andanalysis of computer experiments. Statist. Sci. 4 409–435. MR1041765

[17] Santner, T. J., Williams, B. J. and Notz, W. I. (2003). The Design and Analysisof Computer Experiments. Springer, New York. MR2160708

[18] Tuo, R. and Wu, C. F. J. (2015). Efficient calibration for imperfect computermodels. Ann. Statist. 43 2331–2352. MR3405596

[19] Tuo, R. and Wu, C. F. J. (2018). Prediction based on the Kennedy-O’Hagan calibra-tion model: asymptotic consistency and other properties. Statist. Sinica. 28 743–759.MR3791086

[20] Wahba, G. (1979). Convergence rates of “thin plate” smoothing splines when thedata are noisy. In Smoothing Techniques for Curve Estimation 233–245. Springer,Berlin. MR0564261

[21] Wahba, G. (1990). Spline Models for Observational Data 59. SIAM, Philadelphia,PA. MR1045442

[22] Wong, R. K. W., Storlie, C. B. and Lee, T. C. M. (2017). A frequentist approachto computer model calibration. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 635–648.MR3611763

Department of StatisticsUniversity of Wisconsin-Madison1300 University AvenueMadison, Wisconsin 53706USAE-mail: [email protected]

[email protected]

http://www.ams.org/mathscinet-getitem?mr=1858398http://www.ams.org/mathscinet-getitem?mr=0290013http://www.ams.org/mathscinet-getitem?mr=2724368http://www.ams.org/mathscinet-getitem?mr=0811497http://www.ams.org/mathscinet-getitem?mr=2439867http://www.ams.org/mathscinet-getitem?mr=2088780http://www.ams.org/mathscinet-getitem?mr=3735376http://www.ams.org/mathscinet-getitem?mr=1041765http://www.ams.org/mathscinet-getitem?mr=2160708http://www.ams.org/mathscinet-getitem?mr=3405596http://www.ams.org/mathscinet-getitem?mr=3791086http://www.ams.org/mathscinet-getitem?mr=0564261http://www.ams.org/mathscinet-getitem?mr=1045442http://www.ams.org/mathscinet-getitem?mr=3611763mailto:[email protected]: [email protected]

1 Introduction1.1 Comparison with existing work

2 Prediction-oriented calibration2.1 The identifiability issue2.2 Definition of prediction-oriented calibration

3 Main results3.1 Non-asymptotic minimax theory for nonparametric regressions3.2 Optimal calibration and prediction3.3 Improved prediction using the computer model3.4 Comparison with existing frequentist calibration methods

4 An algorithm4.1 Connection with Bayesian calibration

5 Simulation and real examples6 Concluding remarks and discussionsSupplementary MaterialReferencesAuthor's addresses

By Xiaowu Dai arXiv:1802.00021v2 [stat.ME] 24 Sep 2018 · 2018. 9. 26. · By Xiaowu Dai,y and...

Documents

Transcript of By Xiaowu Dai arXiv:1802.00021v2 [stat.ME] 24 Sep 2018 · 2018. 9. 26. · By Xiaowu Dai,y and...