Bivariate spatial process modeling for constructing indicator or intensity weighted spatial CDFs

Bivariate Spatial Process Modeling forConstructing Indicator or Intensity Weighted

Spatial CDFsMargaret SHORT, Bradley P. CARLIN, and Alan E. GELFAND

A spatial cumulative distribution function (SCDF) gives the proportion of a spatialdomain D having the value of some response variable less than a particular level w. This ar-ticle provides a fully hierarchical approach to SCDF modeling, using a Bayesian frameworkimplemented via Markov chain Monte Carlo (MCMC) methods. The approach generalizesthe customary SCDF to accommodate density or indicator weighting. Bivariate spatial pro-cesses emerge as a natural approach for framing such a generalization. Indicator weightingleads to conditional SCDFs, useful in studying, for example, adjusted exposure to onepollutant given a specied level of exposure to another. Intensity weighted (or populationdensity weighted) SCDFs are particularly natural in assessments of environmental justice,where it is important to determine if a particular sociodemographic group is being exces-sively exposed to harmful levels of certain pollutants. MCMC methods (combined with aconvenient Kronecker structure) enable straightforward estimates or approximate estimatesof bivariate, conditional, and weighted SCDFs. We illustrate our methods with two air pol-lution datasets, one recording both nitric oxide (NO) and nitrogen dioxide (NO2) ambientlevels at 67 monitoring sites in central and southern California, and the other concerningozone exposure and race in Atlanta, GA.

Key Words: Bayesian methods; Change of support problem; Environmental justice; Ge-ographic Information System (GIS); Kriging; Markov chain Monte Carlo methods.

1. INTRODUCTION

1.1 SPATIAL CDFS

Suppose that X(s) is the log-ozone concentration at location s within some studyregion. Thinking of X(s) as a spatial process for all s in some spatial domain D, we might

Margaret Short is Postdoctoral Researcher, Statistical Sciences Group, D-1, Los Alamos National Laboratory, P.O.Box 1663, MS F600, Los Alamos, NM 87545 (E-mail: [email protected]). Bradley P. Carlin is Mayo Professor ofPublic Health, Division of Biostatistics, School of Public Health at the University of Minnesota, Minneapolis, MN55455 (E-mail: [email protected]).AlanE.Gelfand is J.B.DukeProfessor of Statistics andDecisionSciences,Institute of Statistics and Decision Sciences at Duke University, Durham, NC 27708 (E-mail: [email protected]).2005 American Statistical Association and the International Biometric SocietyJournal of Agricultural, Biological, and Environmental Statistics, Volume 10, Number 3, Pages 259275DOI: 10.1198/108571105X58568

259

260 M. SHORT, B. P. CARLIN, AND A. E. GELFAND

wish to nd the proportion of area in D that has ozone concentration below some level w(say, a level above which exposure is considered to be unhealthful). This proportion is arandom variable

F (w) = Pr[s D : X(s) w] = 1|D|

D

Zw(s)ds , (1.1)

where |D| is the area of D, and Zw(s) = 1 if X(s) w, and 0 otherwise. Thinking of(1.1) as a random function of w , we see that it increases from 0 to 1, and is right-continuous. Thus, while F (w) is not the cumulative distribution function (CDF) of X at s(which would be given by Pr[X(s) x], and is not random), it does have all the propertiesof a CDF, and so is referred to as the spatial cumulative distribution function, or SCDF. Fora constant mean stationary process, all X(s) have the same marginal distribution whenceE[F (w)] = Pr

(X(s) w). Overton (1989) introduced the idea of an SCDF, and used it

to analyze data from the National Surface Water Surveys. Lahiri, Kaiser, Cressie, and Hsu(1999) developed a subsampling method that provides (among other things) large-sampleprediction bands for the SCDF.

The empiricalSCDFbased upon dataXs = (X(s1), . . . , X(sI)) atw is the proportionof the X(si) which take values less than or equal to w. Large sample investigation of thebehavior of the empirical SCDF requires care to dene the appropriate asymptotics; seeLahiri et al. (1999) and Zhu, Lahiri, and Cressie (2002) for details. In a Bayesian framework,if X(s) is assumed to be a Gaussian process, the joint distribution of Xs is multivariatenormal, and the predictive distribution of F (w) given Xs can be obtained.

Though (1.1) can be studied analytically, it is difcult toworkwith in practice.However,approximation of (1.1) via Monte Carlo integration is natural, that is, replacing F (w) by

F (w) =1L

L=1

Zw(s) , (1.2)

where the s are chosen randomly in D, and Zw(s) = 1 if X(s) w, and 0 otherwise.Suppose we seek a realization of F (w) from the predictive distribution of F (w) given Xs.Gelfand, Zhu, and Carlin (2001) showed how to sample from the predictive distributionp(Xs | Xs) for Xs arising from new locations s = (s1, . . . , sL).

The predictive distribution of (1.1), approximated as the predictive distribution of (1.2),can be sampled at a given w by obtaining samples X(g)(s), g = 1, . . . , G using such krig-ing, hence Z(g)w (s), and then calculating F (g)(w) using (1.2). The F (g)(w) are approxi-mately samples from p(F (w)|Xs). Interest is in the entire functionF (w), which ismost eas-ily obtained, up to an interpolation, using a grid ofw values {w1 < < wk < < wK},whence the {X(g)(s)} samples give a realization at grid point wk,

F (g)(wk) =1L

L=1

Z(g)wk (s) , (1.3)

where now Z(g)wk (s) = 1 if X(g)(s) wk, and 0 otherwise. Handcock (1999) describeda similar Monte Carlo Bayesian approach to estimating SCDFs in his discussion of Lahiriet al. (1999).

BIVARIATE SPATIAL PROCESS MODELING 261

Expression (1.3) suggests placing all of our F (g)(wk) values in a K G matrix foreasy summarization. For example, a histogram of all the F (g)(wk) in a particular row (i.e.,for a given grid point wk) provides an estimate of the predictive distribution of F (wk). Onthe other hand, each column (i.e., for a given Gibbs draw g) provides (again up to say linearinterpolation) an approximate draw from the predictive distribution of the SCDF. Hence,averaging these columns provides, with interpolation, essentially the posterior predictivemean for F and can be taken as an estimated SCDF. But also, each draw from the predictivedistribution of the SCDF can be inverted to obtain a realization for any quantile of interest(e.g., the median exposure d.50). A histogram of these inverted values in turn provides anestimate of the posterior distribution of this quantile (in this case, p(d.50|Xs)).

In many applications, the denition in (1.1) will not be completely satisfying. Considersay a pollution surface, recalling that F (w) is the proportion of D for which the surface liesbeloww. To interpret this proportion in terms of exposure,wewould be assuming that peopleare uniformly distributed beneath the surface, an assumption that is rarely appropriate. Thisconcern, as well as broader objectives, leads us to extend (1.1) to a more general versionthat enables indicator or density weighting, modeled through a bivariate spatial processspecication.

1.2 MOTIVATING DATASETS

Our interest in this methodology is motivated by two environmental datasets. The rstconsiders the spatial data setting of Figure 1, recently presented and analyzed by Gelfand,Schmidt, Banerjee, and Sirmans (2004). These are the locations of several air pollutantmonitoring sites in central and southern California, all of which measure ozone, carbonmonoxide, nitric oxide (NO), and nitrogen dioxide (NO2). For a given day, suppose wewish to compute an SCDF for the log of the daily median NO exposure adjusted for thelog of the daily median NO2 level (since the health effects of exposure to high levels ofone pollutant may be exacerbated by further exposure to high levels of the other). Here thedata are all point level, but now the problem is one of bivariate kriging (or co-kriging) forboth NO and NO2 in a computationally demanding setting (say, to the L = 500 randomlyselected points shown as small dots in Figure 1). We must also resolve the misalignment inthe data if NO or NO2 values are missing at some of the source sites.

Our second dataset considers the case of an air pollution variable measured at points,with a demographic covariate measured only as a sum or average over blocks (in this case,U.S. postal zip codes). Tolbert et al. (2000) reported on a collection of ambient ozonelevels in the Atlanta, GA, metropolitan area. In this dataset, ozone measurements Xitr areavailable at between 8 and 10 xedmonitoring sites i for day t of year r, where t = 1, . . . , 92(the summer days from June 1 through August 31) and r = 1, 2, 3, corresponding to years1993, 1994, and 1995. Figure 2 shows the one-hour daily peak ozone measurements (inparts per million) at the 10 monitoring sites for a particular day (July 15, 1995). Also shownare the boundaries of 173 zip codes in the Atlanta metropolitan area, with the 36 zips fallingwithin the city of Atlanta encircled by the darker boundary on the map. A weighted SCDF


Figure 1. Locations of NO and NO2 monitoring sites, California air quality data; 500 randomly selected targetlocations are also shown as dots.

that uses the racial makeups of these city zips as the weights could uncover inequities inozone exposure. This requires generalizing our SCDF simulation in (1.3) to accommodatecovariate weighting in the presence of misalignment between the response variable and thecovariate.

Section 2 presents the basic theory underlying our indicator and intensity weightedSCDFs using the theory of bivariate process models. Section 3 then moves on to a morecomplex model that can allow for weighting by covariates arising in spatially misalignedsettings like that of Figure 2. Section 4 illustrates our methods with the two air pollutiondatasets. Finally, Section 5 summarizes our ndings and offers directions for future researchin this area.

2. BASIC THEORYSuppose that {Y(s) = (U(s), V (s)), s D} is a realization from a bivariate spatial

process over D. In what follows we work with a bivariate Gaussian process with constantmean, that is,E(U(s)) U andE(V (s)) V . We denote the cross-covariance function


Figure 2. Zip code boundaries in the Atlanta metropolitan area and one-hour peak ozone levels (ppm) at the 10monitoring sites for July 15, 1995.

for Y(s) by C(s, s) = Y(s),Y(s), that is,

C(s, s) =

(cov(U(s), U(s)) cov(U(s), V (s))cov(V (s), U(s)) cov(V (s), V (s))

).

Valid choices of C are not routine to write down, since they require that for any n andany set of locations {s1, . . . , sn}, the resulting 2n 2n covariance matrix for Y =(Y(s1), . . . ,Y(sn)) is positive denite. A simple choice for C is a separable form, thatis,

C(s, s) = (s, s;) T , (2.1)

where T is positive denite and is a valid correlation function. In practice, is oftentaken to be isotropic (i.e., a function of only d(s, s) = ||s s||, the Euclidean distancebetween s and s). Richer bivariate specications than (2.1) can be built constructively usingthe linear model of coregionalization (Wackernagel 2003). Gelfand et al. (2004) providedgeneral discussion, extension, and linkage to other approaches, along with elaboration ofcomputational issues within an MCMC-Bayes framework. In the remainder of this article,we conne ourselves to separable specications.


Generally, we consider stochastic integrals of the form 1|D|D

q(U(s), V (s))ds. If Dis a compact set, such integrals will exist under weak conditions on q (see, e.g., Yaglom1987, chap. 4). For example, the bivariate structure immediately allows for the denitionof a bivariate SCDF. That is, by setting q(U(s), V (s)) = I(U(s) wu, V (s) wv) wecan dene

FU,V (wu, wv) =1

|D|D

I(U(s) wu, V (s) wv)ds , (2.2)

which gives Pr[s D : U(s) wu, V (s) wv

], the proportion of the region having val-

ues below the given thresholds for say two pollutants. Next, a formal denition of a condi-tional SCDF might be

FU |V (wu|wv) =D

I(U(s) wu, V (s) wv)dsD

I(V (s) wv)ds

=

D

I(U(s) wu)I(V (s) wv)dsD

I(V (s) wv)ds . (2.3)

This expression gives Pr[s D : U(s) wu | V (s) wv

], the proportion of the region

having second pollutant values below the threshold wv that also has rst pollutant valuesbelow the threshold wu. Note that we could easily alter (2.3) by changing the directions ofeither or both of its inequalities, if conditional statements involving high (instead of low)levels of either pollutantwere of interest. Note also that (2.3) is of the form

Dr(s)I(U(s)

wu)ds /D

r(s)ds, where r(s) is an indicator process. Hence we might refer to (2.3) as anindicator weighted SCDF for U(s) over D.

It is evident that more general SCDFs can be created forU(s) using more general formsr(s) = h(V (s)), yielding

Fr(w) =

D

r(s)I(U(s) w)dsD

r(s)ds. (2.4)

Suppose for instance that r(s) = exp(V (s)) is interpreted as an intensity surface. That is,B

r(s)ds provides, say, the expected number of individuals in areal unit B. Then (2.4)provides an intensity weighted SCDF for U(s) over D that we might also refer to as a pop-ulation density weighted SCDF. Hence, Fr(w) adjusts for the nonuniformity of populationdensity, a potential criticism of (1.1) mentioned in Section 1.

If Y = (Y (s1), . . . , Y (sn)) is observed, inference for (2.4) is straightforward. Inparticular, if we collect the (si, si ;) into an n n matrix H(), under (2.1) we obtainY = H() T , where denotes the Kronecker product. Banerjee and Gelfand (2002)claried the computational benet of this model structure: with an inverse Wishart priorfor T independent of the prior on , the full conditional for T also emerges as an inverseWishart. Furthermore, the likelihood involves a 2 2 and an n n matrix, rather than a2n 2n matrix.

We remark thatmissing observations in one or both of theY components can be handledin our framework. For instance, in the case of some locations where U(si) is observed but


V (si) is missing, we could write V = (Vobs,Vmis), and rearrange the response vector asY = (U,V), so that the likelihood becomes Y | ,, T N2n

(() , T H()).

Grouping the components of Y as Yobs = (U,Vobs) and Ymis = Vmis readily producesthe required (Gaussian) full conditional for Ymis via standard formulas.

Returning to the complete data case, analogous to the univariate situation in Sec-tion 1, we require an MCMC algorithm to draw posterior samples ((g),(g), T (g)) fromp(,, T | Y), followed by predictive samples Y(g)s from p(Ys | Y,(g),(g), T (g)) fora collection of locations s = (s1, . . . , sL). But this latter distribution is again routinelyavailable from conditional normal formulas. Introducing a suitable MCMC algorithm toobtain the former samples, we can provide full inference for Fr(w) in a manner analogousto that described in Section 1.

3. A MISALIGNED DATA MODELIn practice, in the case where r(s) = h(V (s)) is interpreted as an intensity surface,

r(s) will typically not be observed at point level. Instead, what would likely be observedare counts nj Poisson(rj |Bj |), where rj = 1|Bj |

Bj

r(s)ds with {Bj} a partition ofD. Unfortunately, inference for Fr(w) under this model is infeasible, as we now clarify.Suppose we use a Monte Carlo integration to approximate Fr(w), that is,

Fr(w) =

h(V (s))I(U(s) w) h(V (s))

. (3.1)

Recalling that i indexes the points at which U(s) is observed, to sample from the predictivedistribution of (3.1) we require samples from p({U(s), V (s)} | {U(si)}, {nj}). Thisposterior arises from the full Bayesian model

j

p(nj |{r(s)}) p({U(s), V (s)} | {U(si)}) p({U(si)}) . (3.2)

In (3.2), the rst term is a product of Poissons where, in fact, nj is really conditional onSj = {V (s) : s Bj}, since rj 1Lj

Sj h(V (s)), where Lj is the number of s

falling in Bj . The second and third terms are multivariate normals arising from the bivariateGaussian processmodel for (U(s), V (s)). (In (3.2)wehave suppressed theGaussian processparameters and their associated priors.) Simulation now would sacrice the convenientKronecker structure. In addition, the full conditional distributions for the V (s) are now nolonger Gaussian. For a suitably large L, run times become prohibitive.

As a result, we turn to an approximate solution. Returning to Fr(w) in (2.4), supposewe write

Fr(w) =

j

Bj

r(s)I(U(s) w)dsj

Bj

r(s)ds

=

j rj |Bj |

(Bj

r(s)I(U(s) w)ds / Bj

r(s)ds)

j rj |Bj |

. (3.3)


If the Bj are fairly small so that r(s) is roughly constant over Bj , then (3.3) implies

Fr(w)

j rj |Bj |(

BjI(U(s) w)ds / |Bj |

)

j rj |Bj |. (3.4)

This approximation becomes arbitrarily accurate as the partition {Bj} becomes ner aslong as, say, r(s) is continuous. However, it does not help with the computing burden, sincethe rjs will still introduce {V (s)} and the integrals will bring in the {U(s)}. Hence,taking a further approximation step, we replace rj with its method of moments estimate,nj/|Bj |, and now approximate Fr(w) by

F r (w) =

j nj

[1

Lj

Sj I(U(s) w)

]

j nj, (3.5)

where the hat and the star reect the two levels of approximation. Inference for (3.5) is quitemanageable, since we are back to a univariate process and full conditional distributions forthe U(s) are Gaussian. In fact, (3.5) is a weighted SCDF, weighted by the observed countsover the Bjs. Usual theory suggests that the predictive mean for F r (w) will not be toofar from the predictive mean for Fr(w). However, the variability associated with F r (w) isanticipated to be smaller than the variability associated with Fr(w), since the former doesnot incorporate the uncertainty in the nj . In this sense, (3.5) is reminiscent of an empiricalBayes estimate.

4. EXAMPLES

4.1 CALIFORNIA AIR QUALITY DATAWe rst illustrate the bivariate Gaussian process approach of Section 2 using data

collected by the California Air Resources Board, and available at www.arb.ca.gov/aqd/aqdcd/aqdcddld.htm. The particular subset we consider are the mean NO and NO2 valuesfor July 6th, 1999, as observed at the 67 monitoring sites shown as crossed circles inFigure 1. Recall that in our Section 2 notation, U corresponds to log(mean NO) while Vcorresponds to log(mean NO2). Indicator weighted SCDFs based on these two variables areof interest since persons already at high NO2 risk may be especially vulnerable to elevatedNO levels. Figure 3 shows interpolated perspective, image, and contour plots of the raw data.Association of the pollutant levels is apparent; in fact, the sample correlation coefcientover the 67 pairs is .74.

We t the separable bivariate model (2.1) using the simple isotropic exponential spatialcovariance structure (si, si ;) = exp(||si si ||), so that . For prior distri-butions, we rst assumed T1 W ((R)1, ) where = 2 and R = 4I2. This is areasonably vague specication, both in its small degrees of freedom and in the relativesize of R (roughly the prior mean of T ), since the entire range of the data (for both log-NOand log-NO2) is only about three units. Next, we assume G(a, b), with the parameters


Figure 3. Interpolated perspective, image, and contour plots of the raw log-NO (rst row) and log-NO2 (secondrow), California air quality data, July 6, 1999.

chosen so that the effective spatial range is half the maximum diagonal distance M in Fig-ure 1 (i.e., 3/E() = .5M ), and the standard deviation is one half of this mean. Finally,we let U and V have vague normal priors (mean 0, variance 1,000).

Our initial Gibbs algorithm sampled over , T1, and . The rst two of these maybe sampled from closed form full conditionals (normal and inverse Wishart, respectively)while is readily sampled on the log scale using Metropolis random walk steps. We usedthree parallel chains to check convergence, followed by a production run of 5,000 samplesfrom a single chain for posterior summarization. Histograms (not shown) of the posteriorsamples for the bivariate kriging model are generally well-behaved and consistent with theresults in Figure 3.

Robustness to model and prior assumptions is always a concern in spatial data analysis.For example, a more natural choice than our exponential kriging model might be one fromthe Mtern class (see, e.g., Banerjee, Carlin, and Gelfand 2004, pp. 2728). We comparedour results with those from the Matrn for several choices of the smoothing parameter (oneof which corresponds to the exponential model), but did not obtain dramatic differences inthe SCDFs. We also checked our Wishart prior specication by redoing our calculationsusing = 1 and 3, and also using R = 3I2. Again the resulting SCDF estimates werevirtually indistinguishable from those shown in the following.

Figure 4 shows perspective plots of raw and kriged log-NO and log-NO2 surfaces,where the plots in the rst column are the (interpolated) raw data (as in the rst column of


Figure 4. Perspective plots of kriged log-NO and log-NO2 surfaces, California air quality data. First column, rawdata; second column, based on a single Gibbs sample; third column, average over 5,000 post-convergence Gibbssamples.

Figure 3), those in the second column are based on a single Gibbs sample, and those in thethird column represent the average over 5,000 post-convergence Gibbs samples. The plotsin this nal column are generally consistent with those in the rst, except that they exhibitthe spatial smoothness we expect of our posterior means.

Figure 5 shows several SCDFs arising through samples from our bivariate krigingalgorithm and evaluated over a 200-point equally spaced grid. First, the solid line showsthe ordinary SCDF (1.2) for log-NO. Next, we computed the weighted SCDF (3.1) forr(s) = h(V (s)) = exp(V (s)). Because V is log-NO2 exposure, this amounts to weightingby NO2 itself. The result is shown as the dotted line in Figure 5. This line is shifted to theright from the unweighted version, indicating higher harmful exposure when the second(positively correlated) pollutant is accounted for.

Also shown as dashed lines in Figure 5 are several weighted SCDFs that result fromusing a particular indicator of whether log-NO2 remains below a certain threshold. Theseweighted SCDFs are thus also conditional SCDFs, as in Equation (2.3). Existing EPA guide-lines and expertise could be used to inform the choice of clinically meaningful thresholds;here we simply demonstrate the procedures behavior for a few illustrative thresholds. Forexample, when the threshold is set to 3.0 (a rather high value for this pollutant on the logscale), nearly all of the weights equal 1, and the weighted SCDF differs little from the un-weighted SCDF. However, as this threshold moves lower (to 4.0 and 5.0), the weightedSCDF moves further to the left of its unweighted counterpart. This movement is under-


Figure 5. Weighted SCDFs for the California air quality data: solid line, ordinary SCDF for log-NO; dotted line,weighted SCDF for log-NO using NO2 as the weight; dashed lines, weighted SCDF for log-NO using variousindicator functions of log-NO2 as the weights.

standable, since our indicator functions are decreasing functions of log-NO2; movement tothe right could be obtained simply by reversing the indicator inequality.

Finally, Figure 6 shows the estimate of the bivariate SCDF (2.2) arising from sam-ples using our bivariate kriging algorithm. This summary provides the proportion of theassessment domain having values (on the log scale) for both pollutants below the indicatedthresholds.

4.2 ATLANTA OZONE AND RACE DATA

As mentioned earlier, we also wish to adjust the SCDF for covariates relevant to envi-ronmental justice investigations. In the setting of Figure 2, we seek to investigate whetherexposure to unhealthfully high ozone levels is higher among certain racial minorities orother subpopulations of interest (children, pregnant women, etc.). Ozone measurementsX(si) will typically be available at specic point locations si, whereas demographic sub-group information will be available only at some block level (zip code, census tract, etc.)from the Census Bureau or other government sources.

As an illustration of the approach of Section 3, consider the top two ArcViewmaps inFigure 7. Figure 7(a) is a grayscale map of the percent of residents living in Atlanta city zipcode j who are black, for j = 1, . . . , 36 (where three small zips with no data are indicatedby an intermediate shade and hatching). Figure 7(b) gives the block-level posterior meanozone values in ppm (exponentiated back from the log scale, on which we t the model)


Figure 6. Bivariate SCDF for the California air quality data.

using data from the 10 Atlanta-area monitoring stations for July 15, 1995, and an MCMCalgorithm based on point-block imputation as in Gelfand et al. (2001). These two mapssuggest some correlation between high ozone and high percent black, though the patternis reversed in the southwestern part of the city; the block-level sample correlation betweenthese two variables is .49.

Next, we detail estimation of the rj in this setting. Notice that the denition of Fr(w)in (2.4) essentially consists of an expectation of an indicator function with respect to thedensity r(s)/

D

r(s)ds. As already mentioned, this allows us to change the scale of theSCDF to whatever is appropriate for the exposure under study. For example, suppose we letnj denote the number of black residents of zip j. Then rj = nj/|Bj | is the observed numberof blacks per unit area (black density), and the weighted SCDF Fr(w) corresponds tothe CDF of ozone exposure of black persons (rather than the land on which they reside).Figure 7 also shows the spatial distribution of observed racial density for blacks (panel(c)) and whites (panel (d)). A visual comparison of the maps of black density and ozoneexposure (Figure 7(b)) suggests even higher correlation than was apparent between percentblack (Figure 7(a)) and ozone exposure, and in fact the block-level sample correlation isalso higher, at .62.

In our point-block imputation model, we followed Gelfand et al. (2001) in using theisotropic exponential spatial covariance function C(si, si ;) = 2 exp(||si si ||), sothat (2, ). We placed a at prior on = E(X(s)), while for the priors on 2 and we assumed that 2 IG(a, b) and G(c, d) where IG and G denote the inverse gammaand gamma distributions with probability density functions p(2|a, b) = e1/(b

2)

(a)ba(2)a+1 and


p(|c, d) = c1e/d(c)dc , respectively. We chose a = 3, b = .5, c = .001, and d = 100,corresponding to fairly vague priors. (Switching to the even vaguer value c = .00001 hadno effect on our tted SCDFs.) We then t this three-parameter model using an MCMCimplementation, sampling and 2 via Gibbs steps and on the log scale via Metropolisrandom walk steps. We ran 21,000 iterations, discarded the rst 1,000 as burn-in, andsubsequently used every 10th sample, so that our summaries below are based on 2,000post-convergence samples.

Figure 7. Summaries for July 15, 1995, Atlanta city zip codes: (a) percent black; (b) estimated posterior meanone-hour peak ozone readings; (c) observed black density; (d) observed white density.


Figure 8. Weighted (black and white) and unweighted SCDF estimates, and corresponding estimated posteriorsfor median exposure (ppm) based on July 15, 1995, one-hour peak ozone data, Atlanta zip codes: (a) SCDFsusing racial densities as the weights, (b) posterior of unweighted median exposure, (c) posterior of differencebetween black-and-white median exposure.

To see if the correlation between either race covariate and ozone exposure is signicant,we used equations (1.2) and (3.5) to obtain pointwise estimates for F (w), F r1(w), andF r2(w), which are the overall, white, and black SCDFs of ozone exposure. The plots (shownin Figure 8(a) using race density rjs for both whites and blacks) indicate that while blacksdo experience somewhat higher exposure (F r1(w) F r2(w)), the difference seems to beof little practical signicance: the estimated horizontal difference between the black andwhite SCDFs is never more than .02, less than would be seen across the Atlanta region ona given day, or across days in the 10-monitor average. Next, inverting our 2,000 samplesof the unweighted curve, we obtain the estimated posterior distribution of overall medianexposure shown in Figure 8(b). Finally, Figure 8(c) histograms the sampled differencesbetween black and white median exposure. Again, while the difference is in the anticipateddirection (higher exposure for blacks), the discrepancy is not large, nor is it signicantlydifferent from 0; see the 95% Bayesian credible interval (BCI) in the gure subtitle.

The lack of both statistical and practically signicant differences between the racialgroups in Figure 8 raises the question of whether more dramatic black-white differencescould ever be obtained in our context. To check this, we reran our analyses on two moretoy datasets. In the rst, we altered the racial percentages in each zip (but kept the ozonevalues and locations xed) with an eye toward increasing exposure inequity. In the second,we instead kept the racial percentages the same, but altered 8 of the 10 observed ozonevalues, increasing those closest to the predominantly black neighborhoods by no morethan 28%, and decreasing those closest to the predominantly white neighborhoods by nomore than 25%. The results (shown in Figures 9(a) and (b) with the original SCDFs fromFigure 8(a) overlaid) do in fact reveal greater separation, especially under the exaggeratedozone toy data (panel (b)). However, our changes do not quite deliver statistical signicancefor the corresponding estimated posterior black minus white median exposures shown inFigures 9(c) and (d). Apparently with only 10 data points arranged in this way (and using


Figure 9. Weighted (black and white) and unweighted SCDF estimates and corresponding estimated posteriorsfor black minus white median exposure, toy Atlanta datasets: (a) SCDFs arising under exaggerated zip-levelracial differences (original SCDFs overlaid); (b) SCDFs arising under exaggerated ozone observations (originalSCDFs overlaid); (c) black minus white median exposure, exaggerated zip-level racial differences; (d) black minuswhite median exposure, exaggerated ozone observations.

a rather simple exponential kriging model), differences much larger than those seen inFigure 8 are difcult to obtain.

Although Figure 8(c) describes the difference in the black versus white median expo-sure, we might still wish to compare the two demographic groups over the entire range ofthe SCDF. In this regard one is naturally led to common CDF comparison statistics, such as = supw |F r1(w) F r2(w)| or =

w

|F r1(w) F r2(w)|dw; in our case we might alsoconsider =

w

max(0, F r1(w) F r2(w))dw, corresponding to the one-sided alternativeF r1(w) > F

r2(w). However, we are not in a standard two-sample setting, since both F

r1(w)

and F r2(w) arise from the same data Xs.We used our MCMC draws to estimate the posterior distribution of under each of the

three denitions above, obtaining posterior means of .317, .0115, and .0104, respectively. Inorder to see if these values are signicantly larger thanwhat wemight observe by chance, wesimply randomly permuted the observed race density pairs among the 36 zips, reestimatedthe posterior mean, and repeated this process 99 times. That is, our permutation testretains each zip-specic pair of racial variables as a unit, but permutes them among the 36zips, resulting in 36! possible permutations. Figure 10 shows histograms of the resulting


Figure 10. Histograms of simulated posterior means for under three denitions: sup norm, =supw|F r1 (w) F r2 (w)|; L1 norm, =

w

|F r1 (w) F r2 (w)|dw; and one-sided L1 norm, =w

max(0, F r1 (w) F r2 (w))dw. Actual observed values, E(|Xs), shown as vertical dotted reference lines.Observed signicance is less than .01 under all three denitions.

three sets of simulated E(|Xs) values, with dotted vertical lines marking the locationsof the observed values given above. In all three cases the observed values are quite a bitlarger than any of the simulated values, suggesting the racial differences in the SCDFs arestatistically signicant at the .01 level in theMonte Carlo permutation test sense (althoughas already mentioned, they do not appear to be of practical signicance).

5. CONCLUDING REMARKS AND FUTURE DIRECTIONSIn this article we have provided hierarchical modeling to enable inference for spatial

CDFs. Although our examples have been specic to air pollution, we envision a broadset of potential application areas. For example, Waller, Turnbull, Clark, and Nasca (1994)investigated the link between exposure to trichloroethylene (TCE) and childhood leukemiain a region of upstate New York. Our weighted SCDFs should prove useful in this cancer-related context, where both environmental justice and simultaneous exposure to multipleputative carcinogens are key issues.

Extension of the above formulas to the spatio-temporal case (mentioned but not pursuedby Lahiri et al. 1999) in the spirit of Gelfand et al. (2001) should be quite feasible, andshould greatly improve the accuracy of our results (since much more data can be broughtto bear with relatively little increase in parameter burden). In particular, we envision acollection of temporally dependent weighted SCDFs Fr,t(w), t = 1, . . . , T , arising throughthe bivariate spatiotemporal process (U(s, t), V (s, t)). Adopting the assumption of space-time separability leads to a double (instead of single) Kronecker product in the covariancestructure, and thus feasible computing. An alternative would be to have the bivariate spatialprocess evolve dynamically.


ACKNOWLEDGMENTSThis work forms part of the rst authors Ph.D. dissertation at the University of Minnesota. The work of all

three authors was supported in part by NSF/EPA grant SES 99-78238, while that of the second author was alsosupported in part by NIH grant 2-R01-ES07750, and that of the third author was also supported in part by NSFgrant DMS 99-71206. The authors are grateful to James Mulholland and Alexandra Mello Schmidt for providingand permitting analysis of the datasets in this article, and to Dan Hagler, Matt Maurer, and Li Zhu, and SudiptoBanerjee for signicant assistance with the computational aspects of this work.

[Received February 2004. Revised March 2005.]

REFERENCES

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004), Hierarchical Modeling and Analysis for Spatial Data, BocaRaton, FL: Chapman & Hall.

Banerjee, S., and Gelfand, A. E. (2002), Prediction, Interpolation and Regression for Spatially Misaligned Data,Sankhya, Ser. A, 64, 227245.

Gelfand, A. E., Schmidt, A. M., Banerjee, S., and Sirmans, C. F. (2004), Nonstationary Multivariate ProcessModeling Through Spatially Varying Coregionalization (with discussion), Test, 13, 150.

Gelfand, A. E., Zhu, L., and Carlin, B. P. (2001), On the Change of Support Problem for Spatio-Temporal Data,Biostatistics, 2, 3145.

Handcock, M. S. (1999), Comment on Prediction of Spatial Cumulative Distribution Functions Using Subsam-pling, Journal of the American Statistical Association, 94, 100102.

Lahiri, S. N., Kaiser, M. S., Cressie, N., and Hsu, N.-J. (1999), Prediction of Spatial Cumulative DistributionFunctions Using Subsampling (with discussion), Journal of the American Statistical Association, 94, 86110.

Overton, W. S. (1989), Effects of Measurements and Other Extraneous Errors on Estimated Distribution Func-tions in the National Surface Water Surveys, Technical Report 129, Department of Statistics, Oregon StateUniversity.

Tolbert, P., Mulholland, J., MacIntosh, D., Xu, F., Daniels, D., Devine, O., Carlin, B.P., Klein, M., Dorley, J.,Butler, A., Nordenberg, D., Frumkin, H., Ryan, P. B., and White, M. (2000), Air Pollution and PediatricEmergency Room Visits for Asthma in Atlanta, American Journal of Epidemiology, 151, 798810.

Wackernagel, H. (2003), Multivariate Geostatistics: An Introduction with Applications (3rd ed.), New York:Springer-Verlag.

Waller, L. A., Turnbull, B. W., Clark, L. C., and Nasca, P. (1994), Spatial Pattern Analyses to Detect Rare DiseaseClusters, in Case Studies in Biometry, eds. N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, andJ. Greenhouse, New York: Wiley, pp. 323.

Yaglom, A. M. (1987), Correlation Theory of Stationary and Related Random Functions, New York: Springer-Verlag.

Zhu, J., Lahiri, S. N., and Cressie, N. (2002), Asymptotic Inference for Spatial CDFs Over Time, StatisticaSinica, 12, 843861.

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 290 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.00000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 800 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 300 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice

Bivariate spatial process modeling for constructing indicator or intensity weighted spatial CDFs

Documents

Transcript of Bivariate spatial process modeling for constructing indicator or intensity weighted spatial CDFs