Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let...

20
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JCGS Tapered Covariance: Bayesian Estimation and Asymptotics Benjamin SHABY and David RUPPERT The method of maximum tapered likelihood has been proposed as a way to quickly estimate covariance parameters for stationary Gaussian random fields. We show that under a useful asymptotic regime, maximum tapered likelihood estimators are consistent and asymptotically normal for covariance models in common use. We then formalize the notion of tapered quasi-Bayesian estimators and show that they too are consistent and asymptotically normal. We also present asymptotic confidence intervals for both types of estimators and show via simulation that they accurately reflect sampling variability, even at modest sample sizes. Proofs, an example, and detailed derivations are provided in the supplementary materials, available online. Key Words: Bayesian inference; Consistency; Covariance estimation; Gaussian process; Tapering. 1. INTRODUCTION Covariance tapering was introduced as a way to mitigate the computational burdens required for calculating statistically relevant quantities involving large covariance ma- trices arising from irregularly spaced spatial data. These computations typically require O(n 3 ) operations, where n is the number of spatial observations. The idea behind tapering is to introduce, in a principled way, many zeros into the covariance matrices, enabling the use of sparse matrix algorithms, which have computational complexities that are generally functions of the number of nonzero elements in the matrix. Tapering has been studied as a way to speed up computations required for optimal spatial prediction (Furrer, Genton, and Nychka 2006; Furrer and Sain 2009) and for Kalman filter updates (Furrer and Bengtsson 2007). Kaufman (2006) and Kaufman, Schervish, and Nychka (2008) introduced the maximum tapered likelihood estimate as a way to use tapered covariance matrices to quickly estimate covariance function parameters. Zhang and Du (2008) and Du, Zhang, and Mandrekar (2009) further explicated the properties of these Benjamin Shaby is Postdoctoral Associate, Department of Statistical Science, Duke University, Durham, NC 27708 (E-mail: [email protected]). David Ruppert is Andrew Schultz, Jr. Professor of Engineering, School of Opera- tions Research and Information Engineering, Cornell University, Ithaca, NY 14853 (E-mail: [email protected]). C 2012 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 21, Number 2, Pages 433–452 DOI: 10.1080/10618600.2012.680819 433 Downloaded by [University of California, Berkeley] at 12:01 15 January 2013

Transcript of Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let...

Page 1: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JCGS

Tapered Covariance: Bayesian Estimation andAsymptotics

Benjamin SHABY and David RUPPERT

The method of maximum tapered likelihood has been proposed as a way to quicklyestimate covariance parameters for stationary Gaussian random fields. We show thatunder a useful asymptotic regime, maximum tapered likelihood estimators are consistentand asymptotically normal for covariance models in common use. We then formalize thenotion of tapered quasi-Bayesian estimators and show that they too are consistent andasymptotically normal. We also present asymptotic confidence intervals for both typesof estimators and show via simulation that they accurately reflect sampling variability,even at modest sample sizes. Proofs, an example, and detailed derivations are providedin the supplementary materials, available online.

Key Words: Bayesian inference; Consistency; Covariance estimation; Gaussianprocess; Tapering.

1. INTRODUCTION

Covariance tapering was introduced as a way to mitigate the computational burdensrequired for calculating statistically relevant quantities involving large covariance ma-trices arising from irregularly spaced spatial data. These computations typically requireO(n3) operations, where n is the number of spatial observations. The idea behind taperingis to introduce, in a principled way, many zeros into the covariance matrices, enabling theuse of sparse matrix algorithms, which have computational complexities that are generallyfunctions of the number of nonzero elements in the matrix.

Tapering has been studied as a way to speed up computations required for optimal spatialprediction (Furrer, Genton, and Nychka 2006; Furrer and Sain 2009) and for Kalmanfilter updates (Furrer and Bengtsson 2007). Kaufman (2006) and Kaufman, Schervish,and Nychka (2008) introduced the maximum tapered likelihood estimate as a way to usetapered covariance matrices to quickly estimate covariance function parameters. Zhang andDu (2008) and Du, Zhang, and Mandrekar (2009) further explicated the properties of these

Benjamin Shaby is Postdoctoral Associate, Department of Statistical Science, Duke University, Durham, NC 27708(E-mail: [email protected]). David Ruppert is Andrew Schultz, Jr. Professor of Engineering, School of Opera-tions Research and Information Engineering, Cornell University, Ithaca, NY 14853 (E-mail: [email protected]).

C© 2012 American Statistical Association, Institute of Mathematical Statistics,and Interface Foundation of North America

Journal of Computational and Graphical Statistics, Volume 21, Number 2, Pages 433–452DOI: 10.1080/10618600.2012.680819

433

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 2: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

434 B. SHABY AND D. RUPPERT

estimators. In addition, Kaufman (2006) discussed approximating Bayesian estimationusing tapered likelihood functions.

Here, we examine the behavior of maximum tapered likelihood estimators, as wellas what we will call tapered quasi-Bayesian estimators [we use the term quasi-Bayesianhere despite its previous introduction in Berger (2000) to describe, pejoratively, somethingcompletely different].

Tapering is not the only approach that has been proposed to quickly compute approxi-mations to the likelihood function for large spatial datasets. When the data are sampled ona regular spatial grid, the resulting structure of the covariance matrix may be exploited toincrease computational efficiency (Whittle 1954; Zimmerman 1989). When data locationsare not gridded, it is still possible to use Fourier transform methods for approximate in-ference either by integrating locations onto a latent grid (Fuentes 2007) or by employinga nonstandard periodogram formulation (Matsuda and Yajima 2009). Another approachfor nongridded data is to factor the full likelihood into conditional likelihoods that ignoredependence on far-away observations (Vecchia 1988; Stein, Chi, and Welty 2004). Com-posite likelihood approaches have also been considered (Heagerty and Lele 1998; Currieroand Lele 1999). Still another approach is to project the data onto a lower-dimensionalspace (Banerjee et al. 2008; Cressie and Johannesson 2008; Finley et al. 2009), althoughtheoretical properties of these techniques are not known.

Here, like Kaufman (2006), Kaufman, Schervish, and Nychka (2008), Zhang and Du(2008), and Du, Zhang, and Mandrekar (2009), we study the use of tapering for parameterestimation. Indeed, the present study may be seen as a follow-up to these works, andmakes use of some of the proof techniques contained in Kaufman (2006) and Kaufman,Schervish, and Nychka (2008). Unlike the previous works, which introduced the method andconsidered the asymptotic behavior of tapering with the popular Matern covariance function,we do not restrict ourselves to a single covariance model. We also devote considerably moretheoretical attention to the quasi-Bayesian perspective than Kaufman (2006).

In addition, while the previous asymptotic studies of the maximum tapered likelihoodestimator are as much results on inconsistency as they are on consistency (see Section 2.1),we provide proofs for consistency and asymptotic normality of both the maximum taperedlikelihood estimator and the tapered quasi-Bayesian estimators. The key reason for thestronger results presented here is that we consider a different type of asymptotic regime.

Asymptotics for random fields, unlike the case of asymptotics for independent data,are somewhat ill-defined because the manner in which sample points are added such thattheir number increases to infinity is not clear-cut. There are two standard approaches toincreasing the number of observations toward infinity (Cressie 1991). The first is calledincreasing-domain asymptotics, where the domain expands in spatial extent, while thesampling density stays constant. The second is called infill, or fixed-domain, asymptotics,where the domain stays constant, while the sampling density increases to infinity.

For spatial prediction, Stein (1999, chap. 3.3) preferred infill asymptotics, arguing essen-tially that because the usual goal for spatial prediction is interpolation, and it is reasonableto posit that the denser the sample gets, the better the interpolation ought to get, we areled toward fixed-domain asymptotics. For parameter estimation, however, the situation issomewhat different in that it is not immediately clear which asymptotics better representsthe case of infinitely increasing information.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 3: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 435

The maximum likelihood estimator for the parameters of popular covariance modelshas been studied under both types of asymptotics. Mardia and Marshall (1984) showedthat the maximum likelihood estimate is consistent and asymptotically normal for manycovariance models under asymptotic sampling that includes increasing-domain asymptoticsas a special case. On the other hand, Zhang (2004) proved that the parameters of the popularMatern covariance model (2.1) are not individually consistently estimable under infillasymptotics. The parameter estimation analogoue to Stein’s (1999) argument for prediction,then, seems to be that because information increases infinitely in the case of increasing-domain asymptotics, we might prefer this scheme over fixed-domain asymptotics, whereinformation does not increase to infinity for all parameters as the number of sample pointsincreases to infinity. In addition, relative to infill, results concerning expanding-domainasymptotics are available for much more general cases.

In terms of how well the limiting distributions under the two asymptotic regimes approx-imate their finite-sample correspondents, Zhang and Zimmerman (2005) found that infillasymptotics are preferable in some situations. However, as we will see from the simulationsin Section 2.2, as long as the spatial extent of the sampling region is large compared withthe range of dependence of the process, increasing-domain asymptotics provide a veryaccurate description of the behavior of the maximum tapered likelihood estimate.

More concretely, let Z(s) be a Gaussian random field, where s is a location index thatvaries continuously over a domain D. Suppose also that we have observed Z at n locationss1, . . . , sn. The covariance between measurements at two locations cov(Z(si), Z(sj )) =C(θ ; si , sj ) is assumed to be a function of only the locations themselves, and is knownup to a parameter θ . We will further assume that Z(s) is second-order stationary and, forsimplicity, has mean zero. For convenience, Z(s) may also be assumed to be isotropic,so that C(θ ; si , sj ) = C(θ ; hij ), where hij = ‖si − sj‖, although this assumption may bedropped.

The log-likelihood of Zn, a vector of n observations of Z(s), is

�n(θ ; Zn) = −1

2log(|�n(θ)|) − 1

2Z′

n�n(θ)−1Zn, (1.1)

up to an additive constant (with respect to θ ), where �ij,n(θ ) = C(θ ; hij ).Dependence of �n(θ ; Zn) and �n(θ) on θ will from here on be suppressed for conve-

nience. Also, for a matrix A, A′ will refer to the transpose of A. The notation f will referto the vector of first derivatives of the scalar function f , and f will refer to the matrix ofsecond derivatives of f . All derivatives are with respect to the vector θ .

Let ρt (L; h) be a correlation function and T be a “taper matrix” such that Tij = ρt (L; hij ).The choice of L is discussed below. One example taken from a class of compactly supportedpolynomials from Wendland (1995, 1998) is

ρt (L; h) = (1 − h/L)6+(1 + 6h/L + 35h2/3L2). (1.2)

Other useful examples of such functions may be found in Wendland (1995, 1998).Define the tapered log-likelihood as

�t,n = −1

2log(|�n ◦ Tn|) − 1

2Z′

n((�n ◦ Tn)−1 ◦ Tn)Zn, (1.3)

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 4: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

436 B. SHABY AND D. RUPPERT

where the ◦ notation denotes the element-wise product, sometimes called the Hadamard orSchur product. Note that (1.3) does not correspond to the log density of any random vector.Importantly, �n ◦ Tn is guaranteed to be positive definite as long as both �n and Tn arepositive definite (Horn and Johnson 1991, p. 458). Equation (1.3) is referred to in Kaufman(2006) and Kaufman, Schervish, and Nychka (2008) as the two-taper approximation.

Their one-taper approximation, which was studied by Zhang and Du (2008) and Du,Zhang, and Mandrekar (2009), does not correspond to an unbiased estimating equation andcan produce significantly biased estimates (Kaufman, Schervish, and Nychka 2008). Wewill not consider it here.

Throughout this study, we favor simplicity of assumptions over slightly more generalresults. In many cases, weaker but more elaborate assumptions are possible, requiring onlyminimal changes to the proofs.

2. THE MAXIMUM TAPERED LIKELIHOOD ESTIMATOR

The maximum tapered likelihood estimate is defined as θ t,n = argmaxθ∈��t,n(θ). In Sec-tion 2.1, we consider this estimator from within the framework of extremum estimators andinvestigate its asymptotic properties. In Section 2.2, we conduct a simulation experiment todetermine, first how quickly and to what extent asymptotic sampling distributions approx-imate empirical sampling distributions, and, second how the taper range affects samplingvariability.

2.1 ASYMPTOTIC BEHAVIOR OF THE MAXIMUM TAPERED LIKELIHOOD ESTIMATOR

Here, we study a form of increasing-domain asymptotics. The requirement of an ex-panding domain is not stated explicitly, but rather, as in Mardia and Marshall (1984), isimplied by eigenvalue conditions on the covariance matrix and its derivatives.

In contrast, Kaufman (2006) and Kaufman, Schervish, and Nychka (2008) studied infillasymptotics. Specifically, they considered the Matern covariance model, defined by

C(σ 2, α, ν; h) = σ 2(αh)ν

�(ν)2ν−1Kν(αh), σ 2, α, ν > 0, (2.1)

where Kν is the modified Bessel function of the second kind of order ν. It is not uncommonto assume that ν is fixed and known, a practice followed later in this article. They showed thatfor a known ν and some fixed α∗, σ 2α∗2ν a.s.−→ σ 2

0 α2ν0 , where σ 2

0 and α0 are the true parametersand σ 2 is the value that maximizes (1.3). Unfortunately, under infill asymptotics, σ 2 andα cannot both be estimated consistently (Zhang 2004).

Kaufman (2006) and Kaufman, Schervish, and Nychka (2008) argued heuristically that agood way to measure sampling variability of the estimator is to study a sandwich estimatorof the sampling covariance matrix (e.g., Durbin 1960; Bhapkar 1972; Morton 1981; Ferreira1982; Godambe and Heyde 1987; Kauermann and Carroll 2001), and mentioned that, underconditions that are not met in the context of Gaussian random fields, quasi-likelihood theorysays that the quasi-likelihood estimator of θ is asymptotically normal with covariance equalto the sandwich matrix (Heyde 1997). We will show formally that under the asymptoticregime we consider, this sandwich matrix is in fact the asymptotic covariance of θ t,n.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 5: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 437

2.1.1 Consistency. Let θ0 be the true parameter vector and P0 the probability measureunder θ0. Also, let E0 and cov0 denote the expectation and the covariance, respectively,under θ0. Let λmax{A} and λmin{A} denote, respectively, the largest and smallest absoluteeigenvalues of a matrix A. We assume throughout that standard measurability conditionshold. We also assume that θ0 ∈ �, a convex compact subset of R

p.In addition, we refer to the following assumptions on the covariance function C(θ ; h),

the taper function ρt (L; h), and the increasing sequence of domains {Dn}:(A0) C(θ ; h) is twice continuously differentiable in θ .

(A1) infn λmin{�n} > 0 and supn λmax{�n} < ∞ for all θ ∈ �.

(A2) supn λmax

{∂�n

∂θk◦ Tn

}< ∞ for all θ ∈ �, k = 1, . . . , p.

(A3) For some N > 0 and γ > 0, the minimum eigenvalue of the matrix −E0[n−1�t,n(θ0)] isgreater than γ for all n ≥ N .

(A4) supn λmax{

∂2�n

∂θ j ∂θk◦ Tn

}< ∞ for all θ ∈ �, j, k = 1, . . . , p.

(B) θ1 = θ2 ⇒ (�n(θ1) ◦ Tn) = (�n(θ2) ◦ Tn).

(C) ρt (L; h) is a valid correlation function with support on [0, L) for some appropriatelychosen L > 0, and is greater than zero for h < L.

Theorem 1 (Consistency). Assume conditions (A0)–(A3), (B), and (C).Then θ t,n = argmaxθ∈��t,n(θ) is consistent.

Proof. See Section S-3 in the supplementary materials. �

We note that the dimension of the domain D does not play a direct role in any of theassumptions. For the covariance models often used in practice, the identifiability assumption(B) is usually satisfied by making sure the taper range L is larger than the smallest interpointdistances in the data. Assumptions (A1)–(A3) take a bit more work to check, but are notprohibitively difficult in most useful contexts. The following example shows one commoncase.

Example 1 (Exponential Covariance Function on a Rectangular Lattice (I)). Acommonly used covariance function is the exponential covariance function C(θ ; h) =σ 2 exp{−αh}, where θ = (σ 2, α)′. Consider the increasing-domain scenario where sam-pling points s1, . . . , sn are placed on a rectangular lattice {Dn} ⊂ �Z

d , for a fixed0 < � < L, and Dn ⊂ Dn+1 for all n.Define the distance matrix Hn, with Hij,n = ‖si − sj‖, where ‖ · ‖ denotes the Euclidiannorm. Let �n = σ 2�n so that the correlation matrix �ij,n = e−αHij,n . Taking derivatives,we get ∂�n

∂σ 2 = �n and ∂�n

∂α= −σ 2Hn ◦ �n.

For any matrix norm, the spectral radius λmax{A} of an n × n matrix A satisfies λmax{A} ≤‖A‖ (Horn and Johnson 1991, p. 297). If we choose the maximum row sum norm (Horn

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 6: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

438 B. SHABY AND D. RUPPERT

and Johnson 1991, p. 295),

λmax{�n} ≤ ‖�n‖∞

= sup1≤i≤n

n∑j=1

e−αHij

≤∑

s∈�Zd

e−α‖s‖. (2.2)

One can check that∫

s∈Rd e−α‖s‖ ds < ∞ for α > 0, which implies that the last line in (2.2)

is finite, so supn λmax{�n} is finite. We finish confirming (A1) by noting that all interpointdistances are fixed away from zero, ensuring that �n is strictly positive definite as n → ∞.

Next, for a taper function ρt (L; h) with support on [0, L),

λmax{(Hn ◦ �n ◦ Tn)} ≤ ‖(Hn ◦ �n ◦ Tn)‖∞

= sup1≤i≤n

n∑j=1

Hij e−αHij ρt (L; Hij )

≤ sup1≤i≤n

n∑j=1

Hij e−αHij 1{Hij <L}

≤∑

s∈�Zd

‖s‖e−α‖s‖1{‖s‖<L}

=∑

s∈�Zd ∩{s:‖s‖<L}

‖s‖e−α‖s‖. (2.3)

The last line in (2.3) is the sum of a finite number of bounded summands, and is thereforeitself finite. Thus, for σ 2 < ∞, supn λmax{(Hn ◦ �n ◦ Tn)} < ∞. Now we can see thatsupn λmax

{∂�n

∂θk◦ Tn

}< ∞ for k = 1, 2, thus verifying Assumption (A2).

For Assumption (A3), it is possible to write down the form of the two eigenvaluesanalytically. Let qij denote the ij th entry of −E0[n−1�t,n(θ0)]. Following the computationsin (S-9), we see that, suppressing the dependence on n for the moment,

q11 = 1

2σ 4tr{(� ◦ T)−1(� ◦ T)(� ◦ T)−1(� ◦ T)

}

= 1

2σ 4tr{In

}

= n

2σ 4,

q12 = q21 = − 1

2σ 2tr{(� ◦ T)−1(H ◦ � ◦ T)

},

q22 = 1

2tr{(� ◦ T)−1(H ◦ � ◦ T)(� ◦ T)−1(H ◦ � ◦ T)

}.

The eigenvalues of −E0[n−1�t,n(θ0)] are

1

2

(q11 + q22 ±

√4q2

12 + (q11 − q22)2). (2.4)

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 7: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 439

Noting that q11 and q22 are both positive, the smaller of the eigenvalues is positivewhen (q11 + q22)2 > 4q2

12 + (q11 − q22)2, or, after rearranging terms, when the determinantq11q22 − q2

12 > 0. Letting An = (�n ◦ Tn)−1(Hn ◦ �n ◦ Tn), we can write the determinantas

q11q22 − q212 = n

4σ 4tr{A2

n

} − 1

4σ 4tr2{An}

= n

4σ 4

n∑i=1

(λi{An})2 − 1

4σ 4

( n∑i=1

λi{An})2

. (2.5)

The right-hand side of (2.5) will decay to zero with increasing n only if the eigenvaluesof An converge to a common value. Because A is symmetric, this can only happen if(�n ◦ Tn)−1(Hn ◦ �n ◦ Tn) − anIn → 0 for some scalar an, which is clearly not the case.The same argument holds for any subsequence Anm

. Thus, neither the sequence defined(2.5) nor any subsequence thereof can decay to zero, and Assumption (A3) is satisfied.

To show Assumption (B), we write (�n(θ ) ◦ Tn)ij = σ 2 exp{−αhij }ρt (L; hij ). Sinceρt (L; h) > 0 when h < L, Assumption (B) holds as long as hij > L for some at least twopairs of points ij and i ′j ′ for which hij = hi ′j ′ . For the rectangular grid, this is true forn ≥ 3.

We now see that the exponential covariance function on an expanding grid satisfiesAssumptions (A) and (B), so the maximum tapered likelihood estimate of θ = (σ 2, α)′ isconsistent.

Although covariance tapering is designed for use on irregularly spaced datasets, a regulargrid was chosen for Example 1 to make the analysis tractable. We demonstrate the use ofthe tapered likelihood on irregularly spaced datasets in simulations in Sections 2.2 and 3.2.

The argument in Example 1 is trivially repeated for other classes of covariance functions.Notably, the Maternν=3/2 model, with C(θ ; h) = σ 2(1 + αh) exp{−αh}, θ = (σ 2, α)′, andthe squared exponential model, with C(θ ; h) = σ 2 exp{−(αh)2}, θ = (σ 2, α)′, are easilyaccommodated.

2.1.2 Asymptotic Normality. Let us first define some important matrices. Let

Pn = E0[�t,n�′t,n],

Qn = −E0[�t,n],

Jn = QnP−1n Qn. (2.6)

The matrix J−1n is familiar to generalized estimating equations, quasi-likelihood, and other

areas, and is referred to by various names, including the sandwich matrix, the Godambeinformation criterion, and the robust information criterion (e.g., Durbin 1960; Bhapkar1972; Morton 1981; Ferreira 1982; Godambe and Heyde 1987; Heyde 1997). We are nowready to prove two useful lemmas.

Lemma 1. Assume conditions (A0)–(A3), (B), and (C).

Then �t,nD−→ N (0, Pn), where Pn = E0[�t,n�

′t,n].

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 8: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

440 B. SHABY AND D. RUPPERT

Proof. See Section S-3 in the supplementary materials. �

Lemma 2. Assume conditions (A0)–(A4), (B), and (C).

Then n−1�t,n − n−1Qn

P0−→ 0, where Qn = −E0[�t,n].

Proof. See Section S-3 in the supplementary materials. �

Introducing some more notation, a p × p symmetric positive definite matrix A can bewritten as A = ODO′ with O orthogonal and D diagonal. Define A1/2 to be the square rootOD1/2O′.

Theorem 2 (Asymptotic Normality). Assume conditions (A0)–(A4), (B), and (C). Then

J1/2n (θ t,n − θ0)

D−→ N (0, I),

where Jn is defined as in (2.6).

Proof. See Section S-3 in the supplementary materials. �

Example 2 (Exponential Covariance Function on a Rectangular Lattice (II)). We con-sider the same setting as in Example 1. Since we have already shown that conditions(A1)–(A3) are satisfied, all we have left is (A4).The second partial derivatives are

∂2�n

∂σ 2∂α= −Hn ◦ �n,

∂2�n

∂σ 4= 0,

∂2�n

∂α2= σ 2Hn ◦ Hn ◦ �n.

We showed in Example 1 that supn λmax{(Hn ◦ �n ◦ Tn)} < ∞, and obviouslysupn λmax{0} < ∞. So, all that is left to do to demonstrate (A4) is show that supn λmax{(Hn ◦Hn ◦ �n ◦ Tn)} < ∞.

As before, for a taper function ρt (L; h) with support on [0, L),

λmax{(Hn ◦ Hn ◦ �n ◦ Tn)} ≤ ‖(Hn ◦ Hn ◦ �n ◦ Tn)‖∞

= sup1≤i≤n

n∑j=1

H2ij e

−αHij ρt (L; Hij )

≤ sup1≤i≤n

n∑j=1

H2ij e

−αHij 1{Hij <L}

≤∑

s∈�Zd

‖s‖2e−α‖s‖1{‖s‖<L}

=∑

s∈�Zd ∩{s:‖s‖<L}

‖s‖2e−α‖s‖. (2.7)

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 9: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 441

Again, as before, the last line in (2.7) is the sum of a finite number of bounded sum-mands, and is therefore itself bounded. Thus, for σ 2 < ∞, we see that supn λmax{(Hn ◦Hn ◦ �n ◦ Tn)} = 1/σ 2 supn λmax

{∂�n

∂σ 2 ◦ Tn

}< ∞. So, (A4) is satisfied, and the maxi-

mum tapered likelihood estimate of θ = (σ 2, α) is asymptotically normal with asymptoticvariance Jn.

We note again that the calculations in Example 2 are trivially modified to accommodatethe Maternν=3/2 and squared exponential models.

2.2 SIMULATION EXAMPLE

To explore the sampling characteristics of the maximum tapered likelihood estimator, wesimulated datasets with exponential and Maternν=3/2 covariance functions, a setup followingExamples 1 and 2. Data locations are sited on a unit grid perturbed with iid uniform(−1/3,1/3) deviations, closely following Stein, Chi, and Welty (2004) and Kaufman, Schervish,and Nychka (2008).

Each dataset consisted of 1000 random samples drawn from an N (0,�), where�ij = σ 2 exp{−αhij } in the exponential case and �ij = σ 2(1 + αh) exp{−αhij } in theMaternν=3/2 case. Here σ 2 was set at 1, and the α parameters were set at 0.2 and0.316, respectively, for the exponential and Maternν=3/2 models. This gives an effec-tive range, the distance at which correlations drop below 0.05, of 15 for all simulations.We used the reparameterization θ = (σ 2, c)′, where c = σ 2α2ν , because this parameteri-zation is more easily identified by the data (Zhang 2004; Kaufman 2006). For the taperfunction, we chose (1.2) from the class of compactly supported polynomial correlationfunctions introduced by Wendland (1995) and first used for tapering by Furrer, Genton,and Nychka (2006).

Datasets were sampled for n = 100, 400, 1600, and 2500. All computations were carriedout on a 2.3 GHz Linux machine using the R package spam for handling sparse matrices.

The covariance function and tapered covariance functions are shown in Figure 1 for thethree different taper ranges used in this simulation experiment. The severity of the taper atL = 3 is evident. The times required to compute a single evaluation of the tapered and fulllog-likelihood functions are shown in Figure 2.

For the longest taper range (L = 15), the overhead required for the sparse matrixoperations and the additional complexity induced by the second tapering operation in (1.3)(tapering the inverse) overwhelms the computational advantages gained by the sparsity.However, for the smaller taper ranges, especially as n gets larger, the computational benefitsof tapering become clear.

For each dataset, the maximum tapered likelihood estimate of θ = (σ 2, c)′ was calculatedusing three different taper ranges, L = 15, 5, and 3. In addition, confidence intervals forθ were computed from J(θ) for each θ . L = 15 was chosen because it is roughly equalto the effective range of the process. L = 5 is equal to 1/α, the range parameter of theexponential covariance function in its usual parameterization. Finally, L = 3 was chosenbecause it represents an extreme case of being close to the smallest value for which eachsample point is guaranteed to have its eight immediate neighbors contained within the taperrange.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 10: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

442 B. SHABY AND D. RUPPERT

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

distanceh

Tapered covariance functions

exponential cov. fn.tapered exponential cov. fn.Maternν=3 2 cov. fn.tapered Maternν=3 2 cov. fn.

Figure 1. Covariance and tapered covariance functions for L = 3, 5, and 15.

0 5000 10000 15000

020

060

010

00

n

seco

nds

fullL=15L=5L=3

Figure 2. Time required, in seconds, for a single evaluation of the tapered and full log-likelihood functions.

Finally, for each combination of L and n, the sample covariance matrix was cal-culated for the 1000 θ t ’s and compared with the matrix J, the asymptotic covari-ance matrix, calculated using (S-9) and (S-11) in Section S-1 in the supplementarymaterials.

It is interesting to note that the domain corresponding to the largest sample sizeconsidered here is a square grid of size 50 units on a side. The domain then is justa few times larger than 15-unit effective range of the simulated process. This config-uration is important to keep in mind when considering the applicability of expanding-domain asymptotics to parameter estimation with data from small or moderately sizeddomains.

The resulting empirical density estimates from the simulations are shown in Figures 3–6.In all of these figures, as expected, we see the empirical densities becoming more symmetric

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 11: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 443

0 2 4

0.0

1.5

n = 100

L=3

0 2 4

0.0

1.5

n = 400

0 2 4

0.0

1.5

n = 1600

0 2 4

0.0

1.5

n = 2500

0 2 4

0.0

1.5

L=5

0 2 40.

01.

50 2 4

0.0

1.5

0 2 4

0.0

1.5

0 2 4

0.0

1.0

2.0

L=15

0 2 4

0.0

1.0

2.0

0 2 4

0.0

1.0

2.0

0 2 4

0.0

1.0

2.0

Figure 3. Kernel density estimates for σ 2t for the exponential model.

0.10 0.25

020

50

n = 100

L=3

0.10 0.25

020

50

n = 400

0.10 0.25

020

50

n = 1600

0.10 0.25

020

50n = 2500

0.10 0.25

020

50L=

5

0.10 0.25

020

50

0.10 0.25

020

50

0.10 0.25

020

50

0.10 0.25

020

4060

L=15

0.10 0.25

020

4060

0.10 0.25

020

4060

0.10 0.25

020

4060

Figure 4. Kernel density estimates for ct for the exponential model.

and more sharply peaked as the sample size increases. Comparing across taper ranges, em-pirical density plots for σt at the three taper ranges look very similar for both covariancemodels. The densities of ct show small differences for the exponential model and markeddifferences for the Maternν=3/2 model, but they are not as large as one might expect whentapering so severely. Kaufman, Schervish, and Nychka (2008) noted the similarities be-tween the empirical densities computed from maximum likelihood estimates and maximumtapered likelihood estimates, when applied to the exponential model.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 12: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

444 B. SHABY AND D. RUPPERT

0 2 4

0.0

1.0

2.0

n = 100

L=3

0 2 4

0.0

1.0

2.0

n = 400

0 2 4

0.0

1.0

2.0

n = 1600

0 2 4

0.0

1.0

2.0

n = 2500

0 2 4

0.0

1.0

2.0

L=5

0 2 40.

01.

02.

00 2 4

0.0

1.0

2.0

0 2 4

0.0

1.0

2.0

0 2 4

0.0

1.0

2.0

L=15

0 2 4

0.0

1.0

2.0

0 2 40.

01.

02.

00 2 4

0.0

1.0

2.0

Figure 5. Kernel density estimates for σ 2t for the Maternν=3/2 model.

0.00 0.06

010

020

0

n = 100

L=3

0.00 0.06

010

020

0

n = 400

0.00 0.06

010

020

0

n = 1600

0.00 0.06

010

020

0n = 2500

0.00 0.06

010

020

0L=

5

0.00 0.06

010

020

0

0.00 0.06

010

020

0

0.00 0.06

010

020

0

0.00 0.06

010

020

0L=

15

0.00 0.06

010

020

0

0.00 0.06

010

020

0

0.00 0.06

010

020

0

Figure 6. Kernel density estimates for ct for the Maternν=3/2 model.

Coverage rates for nominal 95% confidence intervals are shown in Tables 1 and 2. Thenormal approximation for the c parameter is excellent (see QQ-plots, Figures S-6 and S-8in the supplementary materials), and the coverage rates are spot-on for both covariancemodels. Despite notable departures from normality for the σ 2 estimates (see Figures S-5and S-7 in the supplementary materials), coverage rates are quite reasonable for moderatelysized datasets and excellent for large datasets.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 13: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 445

Table 1. Coverage rates for nominal 95% confidence intervals, constructedusing the sandwich formula, for the maximum tapered likelihood estimates ofthe σ 2 parameter of exponential and Maternν=3/2 covariance models

n

L 100 400 1600 2500

3 0.81 0.85 0.92 0.91Exponential 5 0.76 0.87 0.92 0.94

15 0.79 0.85 0.89 0.943 0.78 0.88 0.92 0.93

Maternν=3/2 5 0.79 0.86 0.92 0.9215 0.81 0.89 0.92 0.92

Table 2. Coverage rates for nominal 95% confidence intervals, constructedusing the sandwich formula, for the maximum tapered likelihood estimates ofthe c parameter of exponential and Maternν=3/2 covariance models

n

L 100 400 1600 2500

3 0.94 0.94 0.95 0.94Exponential 5 0.95 0.94 0.96 0.96

15 0.95 0.94 0.95 0.943 0.94 0.94 0.96 0.94

Maternν=3/2 5 0.95 0.95 0.95 0.9615 0.95 0.95 0.93 0.95

For each sample size and taper range, tables comparing sample covariance matricesof the 1000 estimates of θ with their corresponding asymptotic calculations using thesandwich matrix (2.6) can be found in Section S-5. Importantly, Tables S-1 and S-2 showthat asymptotic calculations of sampling variability become accurate at moderate samplesizes.

3. THE TAPERED QUASI-BAYESIAN ESTIMATOR

Often one might prefer Bayesian estimation over maximum likelihood-type estima-tion for covariance parameters of Gaussian random fields for all the usual reasons: theability to incorporate prior knowledge, the natural inclusion of parameter shrinkage, thestraightforward extension to larger hierarchical models, and so on.

Here, we investigate the properties of tapered quasi-Bayesian estimators, which areanalogous to Bayesian estimators, only the likelihood is replaced by the tapered likelihood.Specifically, we show that tapered quasi-Bayesian estimators are consistent in that thequasi-posterior, defined in (3.1), converges to a point mass at the true parameter θ0. Wealso show that the quasi-posterior is asymptotically normal, and hence that samples fromthe quasi-posterior [e.g., generated by Markov chain Monte Carlo (MCMC)] can be usedto construct consistent confidence intervals for θ .

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 14: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

446 B. SHABY AND D. RUPPERT

Following Chernozhukov and Hong (2003), we define the tapered quasi-posterior dis-tribution as

πt,n(θ |Zn) = Lt,n(θ ; Zn)π (θ)∫�

Lt,n(θ ; Zn)π (θ) dθ, (3.1)

where Lt,n(θ ; Zn) = exp {�t,n(θ ; Zn)}, and π (θ) is a prior density on θ . We will assume, forconvenience, that π (θ) is proper. Recall that Lt,n is not a density, and thus πt,n(θ |Zn) is nota true posterior. We are guaranteed, however, that as long as the prior π (θ) is proper, thenπt,n(θ |Zn) will be a proper density (Kaufman 2006).

In Section 3.1, we closely follow Chernozhukov and Hong (2003), who studied quasi-Bayesian estimation in a more general context. However, while their formulation was thesame, their outlook was quite different. They viewed quasi-Bayesian estimation as a toolto enable the use of MCMC to maximize objective functions that are not differentiableor are otherwise poorly behaved. That is, they viewed it as a way to use Bayesian com-putational machinery to compute frequentist estimators. Although we study frequentistproperties of quasi-Bayesian estimators here, we consider the quasi-Bayesian frameworkas a computationally tractable alternative to exact Bayesian methods.

3.1 ASYMPTOTIC BEHAVIOR OF TAPERED QUASI-BAYESIAN ESTIMATORS

The results in this section are applications of theorems in Chernozhukov and Hong(2003), which we restate in the language of tapering. The main difficulty in applying theirtheory to the case of the tapered likelihood is showing that their assumptions 1–4 are satis-fied. In particular, Chernozhukov and Hong (2003) assumed consistency of the extremumestimator and asymptotic normality of the first derivative of the objective function, whichwe have shown in Theorem 1 and Lemma 2. The remaining assumptions of Chernozhukovand Hong (2003) follow easily from the conditions for Theorem 1 and Lemmas 1 and 2, sowe will assume that they hold throughout this section.

3.1.1 Convergence of the Tapered Quasi-Posterior. First, we define the total vari-ation of moments norm of a real-valued measurable function f as ‖f ‖TVM(ω) ≡ ∫

(1 +‖δ‖ω)|f (δ)| dδ. Note that the special case of ω = 0 is the usual total variation norm.

We now define the parameter δ, the scaled deviation from θ0, centered at the taperedscore �t,n(θ0) [scaled by �t,n(θ0)] as δ = √

n(θ − θ0) − √n �t,n(θ0)−1�t,n(θ0). The tapered

quasi-posterior density of δ is then π∗t,n(δ|Zn) = 1√

nπt,n

(δ√n

+ θ0 + �t,n(θ0)−1�t,n(θ0)). We

are now ready to state a consistency result about the tapered quasi-posterior distribution.

Theorem 3 (Theorem 1 of Chernozhukov and Hong (2003)). For any 0 ≤ ω < ∞,

‖π∗t,n(δ|Zn) − π∗

t,∞(δ|Zn)‖TVM(ω)P−→ 0,

where

π∗t,∞(δ|Zn) =

( |Qn/n|(2π )p

)1/2

exp

{− 1

2δ′(Qn/n)δ

}.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 15: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 447

From Theorem 3, we see that πt,n(θ |Zn) concentrates its mass at θ0 at a rate of 1/√

n,as measured by the total variation of moments norm. Then, asymptotically, πt,n(θ |Zn) ap-proximates a normal random variable with mean θ0 + Q−1

n �t,n(θ0) and covariance matrixQ−1

n .

3.1.2 Tapered Quasi-Bayesian Point Estimates. We will construct tapered quasi-Bayesian point estimates in a manner analogous to the construction of proper Bayesianestimators. Let the scalar function κn(u) be a loss function. For simplicity, we will onlyconsider symmetric loss functions, with κn(u) = κn(−u), although this restriction is notnecessary. Common symmetric loss functions include the quadratic and absolute lossfunctions.

We can now define the tapered quasi-posterior risk function as the expected loss (withrespect to the quasi-posterior), Rn(θ ) = ∫

�κn(θ − θ∗)πt,n(θ∗|Zn) dθ∗. Then, for a given

choice of loss function κn(u), the tapered quasi-Bayesian estimator is the value of θ thatminimizes the tapered quasi-posterior risk, θQB = argminθ∈�Rn(θ). As usual, quadraticand absolute loss functions will lead, respectively, to the tapered quasi-posterior mean andmedian as the quasi-Bayesian estimators.

Theorem 4 (Theorem 2 of Chernozhukov and Hong (2003)). For a symmetric loss func-tion κn(u),

J1/2n (θQB − θ0)

D−→ N (0, I).

What we see from Theorem 4 is that tapered quasi-Bayesian estimators such as thequasi-posterior mean and median are asymptotically normal with covariance equal to thesandwich matrix Jn, the same asymptotic distribution as the maximum tapered likelihoodestimator.

3.1.3 Tapered Quasi-Bayesian Confidence Regions. We now turn from the questionof constructing point estimates to the question of constructing confidence regions fromthe tapered quasi-posterior. Here, we continue to use a frequentist vocabulary to derivefrequentist properties, even though we are studying quasi-Bayesian inference.

In constructing intervals, what we would like, from a practical point of view, is todirectly use the empirical quantiles of a sample (generated from MCMC) from the taperedquasi-posterior as our confidence region. What we will see, however, is that this approachdoes not yield asymptotically valid confidence intervals. The question of how good suchregions are is investigated in Section 3.2.1.

Although the quantiles of πt,n(θ |Zn) do not converge to the quantiles of the limitingnormal distribution of θQB, we can still use a quasi-posterior sample to construct intervalsthat are consistent using the delta method.

Theorem 5 (Theorem 4 of Chernozhukov and Hong (2003)). Let g(θ ) be some scalar

function of θ , and let Jn be a consistent estimator for Jn. That is, J−1n Jn

P−→ I.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 16: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

448 B. SHABY AND D. RUPPERT

Define

ct,g,n(α) = g(θQB) + zα

√g(θQB)′J−1

n g(θQB),

where zα is the αth quantile of the standard normal distribution. Then,

limn→∞ P {ct,g,n(α/2) ≤ g(θ0) ≤ ct,g,n(1 − α/2)} = 1 − α.

This method, of course, requires one to somehow calculate J−1n , which itself requires

Pn and Qn. Assuming a sample from πt,n(θ |Zn), generated by MCMC, is available, a simpleway of estimating Q−1

n that immediately presents itself is to compute the sample covariancematrix of the chain. Another possibility is to plug θQB into (S-9) and (S-11) to estimatePn and Qn. We compare these methods in Section 3.2.1.

3.2 SIMULATION EXAMPLE

Simulations to explore the sampling properties of tapered quasi-Bayesian estimatorswere set up in exactly the same way as in Section 2.2, with 1000 samples drawn froman N (0,�), where � comes from an exponential model and a Maternν=3/2 model,with σ 2 = 1 and c = σ 2α = 0.2 for the exponential case and c = σ 2α3 ≈ 0.03 for theMaternν=3/2 case, giving effective ranges of around 15. We again used the taper function(1.2) from Wendland (1995). Sample points were again located on an expanding perturbedsquare grid, with n = 100, 400, 1600, and 2500. Both parameters were assigned Cauchyprior distributions with scale parameter 10, truncated to have nonnegative support. Thisprior is proper but extremely weakly informative.

Because of the computational infeasibility of conducting 1000 long MCMC runs foreach model, sample size, and taper range, MCMC was avoided by defining the estimatorsθQB as quasi-posterior modes. This enabled θQB to be computed much faster by usingnumerical optimization routines on the unnormalized quasi-posteriors.

Empirical density estimates of the resulting estimates are almost indistinguishable fromthose in Section 2.2, and can be found in Section S-4 of the supplementary materials.Figures S-1, S-2, S-3, and S-4 show the same trend as in Section 2.2 of collecting mass atthe true parameter as the sample size increases, while, in the case of σ 2 for both modelsand c for the exponential model, not dissipating much as the taper range decreases.

Tables S-3 and S-4 in the supplementary materials also tell a similar story that for themodels considered here, decreases in statistical efficiency due to tapering are acceptablewhen factored against the increases in computational efficiency, and that the asymptoticvariances based on the sandwich matrix provide good estimates of sampling variability,even at modest sample sizes.

3.2.1 Comparing Confidence Intervals. An additional simulation was conducted totest the accuracy of confidence intervals constructed from MCMC samples, as described inSection 3.1.3. About 1000 datasets were constructed on a perturbed grid, with n = 1600,from the exponential and Maternν=3/2 covariance models. In both cases, the range parameterc was chosen to give an effective range of 4.75 for the associated process. We reduced theeffective range in this simulation because the previous effective range of 15 correspondedto c parameters that lie very close to the boundary, making MCMC unreliable enough that

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 17: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 449

Table 3. Coverage rates based on draws from the quasi-posterior for nominal95% credible intervals for the exponential model, c = σ 2α = 1.58, and theMaternν=3/2 model, c = σ 2α3 = 1 (both with effective range ≈ 4.75)

Type σ 2 c

Naive 0.697 1.000Exponential Asymptotic 0.946 0.947

MCMC 0.950 0.956Naive 0.623 1.000

Maternν=3/2 Asymptotic 0.951 0.948MCMC 0.954 0.952

it would have necessitated human inspection to ensure convergence—a requirement that isinfeasible for such a large simulation.

For each dataset, 95% confidence intervals were constructed in three ways: as the 0.025and 0.975 quantiles of the MCMC sample, using Theorem 5 from Section 3.1.3 withQ estimated by plugging θQB into (S-9), and using Theorem 5 with Q estimated as theinverse of the sample covariance of the MCMC sample. We refer to these three methods,respectively, as the naive, asymptotic, and MCMC estimators.

From Table 3, we see that the naive intervals substantially undercover σ 2 while catas-trophically overcovering c for both covariance models. The asymptotic and MCMC intervalsboth do extremely well, covering at almost exactly their nominal 95% rate. The MCMCmethod is easier to compute, which recommends it slightly over the asymptotic method.

4. DISCUSSION

Covariance tapering provides a way to estimate parameters of stationary Gaussian ran-dom fields from very large datasets. The method provides huge gains in computational ef-ficiency by paying a relatively small price in statistical efficiency, and extends the tractablesize of problems many fold relative to standard likelihood estimation tools. We have em-phasized computational efficiencies induced by using sparse matrix computations, but thesparse representation also allows for large improvements in storage efficiency, enablingone to work with datasets for which dense covariance matrices cannot even be stored inmemory.

In our simulation experiments, we use a taper function (1.2) from Wendland (1995),which was chosen somewhat arbitrarily. The form of the taper function does not play ameaningful role in the asymptotic theory developed here, so little guidance on choosingbetween available compactly supported correlation functions is offered. The Wendland(1995, 1998) family of functions seems to be a safe choice, as they are fast to compute, andare supported by the infill asymptotic theory of Kaufman, Schervish, and Nychka (2008).

We show that the asymptotic theory for tapering provides a way to construct sensibleconfidence intervals. Furthermore, we provide a formal framework for incorporating taperedlikelihoods into an approximate Bayesian inferential and computational engine. This quasi-Bayesian theory elucidates the pitfalls, first noted by Kaufman (2006), of treating the tapered

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 18: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

450 B. SHABY AND D. RUPPERT

likelihood as though it were a likelihood. Indeed, the quasi-posterior only approximates thetrue posterior distribution well when the expected Hessian Q ≈ P, the Fisher informationmatrix. Furthermore, from a frequentist standpoint, we would not want the quasi-posterior toapproximate the true posterior when Q is very different from P (as occurs under aggressivetapering), for if it did, the quasi-posterior would not accurately reflect the uncertainty ofany point estimator θQB.

The theoretical results presented here are only possible in the expanding-domain asymp-totic regime, which others have argued is less appropriate than the infill regime (Zhang 2004;Zhang and Zimmerman 2005). In the context of parameter estimation, we have suggestedthe converse in Section 1, and demonstrated, through simulation, the accuracy of calcula-tions based on expanding-domain asymptotics in a scenario where the domain is not muchlarger than the range of the process. In practice, we suspect that it would not be difficult tofind datasets for which the characteristics of parameter estimation are better approximatedby one or the other types of asymptotics.

SUPPLEMENTARY MATERIALS

R code: The code needed to replicate the simulations in this article is provided in thesupplementary files. Please read the file README contained in the .gz file for details.(taper supp.tar.gz gzip archive)

Appendix: Proofs, detailed derivations, and a data analysis example are given in theAppendix. (taper append.pdf)

ACKNOWLEDGMENTS

This work was supported by NSF grants ITS 0612031 and DMS-0805975, and NIH grant R37 CA057030. Theauthors are grateful for the helpful comments of three anonymous reviewers.

[Received November 2010. Revised March 2011.]

REFERENCES

Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008), “Gaussian Predictive Process Models for LargeSpatial Data Sets,” Journal of the Royal Statistical Society, Series B, 70, 825–848. [434]

Berger, J. O. (2000), “Bayesian Analysis: A Look at Today and Thoughts of Tomorrow,” Journal of the AmericanStatistical Association, 95, 1269–1276. [434]

Bhapkar, V. P. (1972), “On a Measure of Efficiency of an Estimating Equation,” Sankhya, Series A, 34, 467–472.[436,439]

Chernozhukov, V., and Hong, H. (2003), “An MCMC Approach to Classical Estimation,” Journal of Econometrics,115, 293–346. [446,446,447]

Cressie, N. A. C. (1991), Statistics for Spatial Data, Wiley Series in Probability and Mathematical Statistics:Applied Probability and Statistics, New York: Wiley. [434]

Cressie, N., and Johannesson, G. (2008), “Fixed Rank Kriging for Very Large Spatial Data Sets,” Journal of the

Royal Statistical Society, Series B, 70, 209–226. [434]

Curriero, F. C., and Lele, S. (1999), “A Composite Likelihood Approach to Semivariogram Estimation,” Journal

of Agricultural, Biological, and Environmental Statistics, 4, 9–28. [434]

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 19: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

TAPERED COVARIANCE 451

Du, J., Zhang, H., and Mandrekar, V. (2009), “Fixed-Domain Asymptotic Properties of Tapered MaximumLikelihood Estimators,” The Annals of Statistics, 37, 3330–3361. [433,434,436]

Durbin, J. (1960), “Estimation of Parameters in Time-Series Regression Models,” Journal of the Royal StatisticalSociety, Series B, 22, 139–153. [436,439]

Ferreira, P. E. (1982), “Multiparametric Estimating Equations,” Annals of the Institute of Statistical Mathematics,34, 423–431. [436,439]

Finley, A., Sang, H., Banerjee, S., and Gelfand, A. (2009), “Improving the Performance of Predictive ProcessModeling for Large Datasets,” Computational Statistics & Data Analysis, 53, 2873–2884. [434]

Fuentes, M. (2007), “Approximate Likelihood for Large Irregularly Spaced Spatial Data,” Journal of the American

Statistical Association, 102, 321–331. [434]

Furrer, R., and Bengtsson, T. (2007), “Estimation of High-Dimensional Prior and Posterior Covariance Matricesin Kalman Filter Variants,” Journal of Multivariate Analysis, 98, 227–255. [433]

Furrer, R., Genton, M. G., and Nychka, D. (2006), “Covariance Tapering for Interpolation of Large SpatialDatasets,” Journal of Computational and Graphical Statistics, 15, 502–523. [433,441]

Furrer, R., and Sain, S. (2009), “Spatial Model Fitting for Large Datasets With Applications to Climate andMicroarray Problems,” Statistics and Computing, 19, 113–128. [433]

Godambe, V. P., and Heyde, C. C. (1987), “Quasi-Likelihood and Optimal Estimation,” International StatisticalReview, 55, 231–244. [436,439]

Heagerty, P. J., and Lele, S. R. (1998), “A Composite Likelihood Approach to Binary Spatial Data,” Journal of

the American Statistical Association, 93, 1099–1111. [434]

Heyde, C. C. (1997), Quasi-Likelihood and Its Application: A General Approach to Optimal Parameter Estimation,

Springer Series in Statistics, New York: Springer-Verlag. [436,439]

Horn, R. A., and Johnson, C. R. (1991), Topics in Matrix Analysis, Cambridge: Cambridge UniversityPress. [436,437]

Kauermann, G., and Carroll, R. J. (2001), “A Note on the Efficiency of Sandwich Covariance Matrix Estimation,”Journal of the American Statistical Association, 96, 1387–1396. [436]

Kaufman, C. (2006), “Covariance Tapering for Likelihood-Based Estimation in Large Spatial Datasets,” Ph.D.dissertation, Carnegie Mellon University, Pittsburgh, PA. [433,434,436,436,441,446,449]

Kaufman, C., Schervish, M., and Nychka, D. (2008), “Covariance Tapering for Likelihood-Based Esti-mation in Large Spatial Datasets,” Journal of the American Statistical Association, 103, 1545–1569.[433,434,436,436,441,443,449]

Mardia, K. V., and Marshall, R. J. (1984), “Maximum Likelihood Estimation of Models for Residual Covariancein Spatial Regression,” Biometrika, 71, 135–146. [435,436]

Matsuda, Y., and Yajima, Y. (2009), “Fourier Analysis of Irregularly Spaced Data on Rd ,” Journal of the Royal

Statistical Society, Series B, 71, 191–217. [434]

Morton, R. (1981), “Efficiency of Estimating Equations and the Use of Pivots,” Biometrika, 68, 227–233. [436,439]

Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Kriging, Springer Series in Statistics, NewYork: Springer-Verlag. [434]

Stein, M. L., Chi, Z., and Welty, L. J. (2004), “Approximating Likelihoods for Large Spatial Data Sets,” Journal

of the Royal Statistical Society, Series B, 66, 275–296. [434,441]

Vecchia, A. V. (1988), “Estimation and Model Identification for Continuous Spatial Processes,” Journal of the

Royal Statistical Society, Series B, 50, 297–312. [434]

Wendland, H. (1995), “Piecewise Polynomial, Positive Definite and Compactly Supported Radial Functions ofMinimal Degree,” Advances in Computational Mathematics, 4, 389–396. [435,441,448,449]

——— (1998), “Error Estimates for Interpolation by Compactly Supported Radial Basis Functions of MinimalDegree,” Journal of Approximation Theory, 93, 258–272. [435,449]

Whittle, P. (1954), “On Stationary Processes in the Plane,” Biometrika, 41, 434–449. [434]

Zhang, H. (2004), “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatis-tics,” Journal of the American Statistical Association, 99, 250–261. [435,436,441,450]

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013

Page 20: Tapered Covariance: Bayesian Estimation and AsymptoticsTAPERED COVARIANCE 437 2.1.1 Consistency. Let θ 0 be the true parameter vector and P 0 the probability measure under θ 0.Also,

452 B. SHABY AND D. RUPPERT

Zhang, H., and Du, J. (2008), “Covariance Tapering in Spatial Statistics,” in Positive Definite Functions: From

Schoenberg to Space-Time Challenges, eds. J. Mateu and E. Porcu, Spain: Graficas Casta n, s.l. [433,434,436]

Zhang, H., and Zimmerman, D. L. (2005), “Towards Reconciling Two Asymptotic Frameworks in Spatial Statis-tics,” Biometrika, 92, 921–936. [435,450]

Zimmerman, D. L. (1989), “Computationally Exploitable Structure of Covariance Matrices and GeneralizedCovariance Matrices in Spatial Models,” Journal of Statistical Computation and Simulation, 32, 1–15. [434]

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, B

erke

ley]

at 1

2:01

15

Janu

ary

2013