An Introduction to Model-based Geostatistics - Home | · PDF file · 2009-09-18An...
Transcript of An Introduction to Model-based Geostatistics - Home | · PDF file · 2009-09-18An...
An Introduction to Model-basedGeostatistics
Peter J Diggle
School of Health and Medicine, Lancaster University
and
Department of Biostatistics, Johns Hopkins University
September 2009
Outline
• What is geostatistics?
• What is model-based geostatistics?
• Two examples
– constructing an elevation surface from sparse data
– tropical disease prevalence mapping
Example: surface elevation data
1
2
3
4
56
X
0
1
2
3
4
5
6
Y
6570
7580
8590
9510
0Z
Geostatistics
• traditionally, a self-contained methodology for spatialprediction, developed at Ecole des Mines,Fontainebleau, France
• nowadays, that part of spatial statistics which isconcerned with data obtained by spatially discretesampling of a spatially continuous process
Kriging: find the linear combination of the data that bestpredicts the value of the surface at an arbitrary location x
Model-based Geostatistics
• the application of general principles of statisticalmodelling and inference to geostatistical problems
– formulate a statistical model for the data
– fit the model using likelihood-based methods
– use the fitted model to make predictions
Kriging: minimum mean square error prediction underGaussian modelling assumptions
Gaussian geostatistics (simplest case)
Model
• Stationary Gaussian process S(x) : x ∈ IR2
E[S(x)] = µ Cov{S(x), S(x′)} = σ2ρ(‖x − x′‖)
• Mutually independent Yi|S(·) ∼ N(S(x), τ2)
Point predictor: S(x) = E[S(x)|Y ]
• linear in Y = (Y1, ..., Yn);
• interpolates Y if τ2 = 0
• called simple kriging in classical geostatistics
Predictive distribution
• choose the target for prediction, F(S),where S = {S(x) : x ∈ A}
• draw samples Si : i = 1, ..., N from [S|Y ]
• then Fi = F(Si) : i = 1, ..., N is a sample from requiredpredictive distribution [F(S)|Y ]
Interpolating the elevation surface
Under Gaussian modelling assumptions, we need to:
• identify a parametric family of correlation functions
• fit the model
• use the model for prediction
• identify a parametric family of correlation functions
The empirical variogram
(xi, Yi) : i = 1, ..., n uij = ||xi − xj || vij =1
2(yi − yj)
2
The theoretical variogram
V (u) =1
2Var{Y (x) − Y (x − u)} = τ2 + σ2{1 − ρ(u)}
Exploratory analysis
E[vij] = V (uij) ⇒ smoothed scatterplot of (uij, vij)identifies rough shape of ρ(u) and initial estimates ofmodel parameters
geoR code:
library(geoR)
data(elevation)
summary(elevation)
vario<-variog(elevation,uvec=0.2*(0:25))
plot(vario)
?variog
vario2<-variog(elevation,uvec=0.2*(0:25),trend="1st")
plot(vario2)
plot(vario$u,vario$v,type="l",xlim=c(0,5),ylim=c(0,7000),
xlab="u",ylab="V(u)")
lines(vario2$u,vario2$v,col="red")
• identify a parametric family of correlation functions
• fit the model
1. Classical: compute maximum likelihood estimates θ
2. Bayesian: prior [θ] implies posterior [θ|Y ]
geoR code for option 1:
mlfit<-likfit(elevation,ini.cov.pars=c(5000,2.0),
cov.model="matern",kappa=1)
• identify a parametric family of correlation functions
• fit the model
• use the model for prediction
1. Plug-in:[S|Y ; θ]
2. Bayesian:[S|Y ] =∫[S|Y ; θ][θ|Y ]dθ
geoR code for option 1:
region<-matrix(c(0,0,6.4,0,6.4,6.4,0,6.4),4,2,T)
grid<-pred_grid(region,by=0.2)
KC<-krige.control(obj.model=mlfit)
OC<-output.control(n.predictive=100)
set.seed(24367)
predictions<-krige.conv(geodata=elevation,locations=grid,
borders=region,krige=KC,output=OC)
image(predictions)
points(elevation,add=T)
Tropical disease prevalence mapping
• “river blindness” – an endemic disease in wet tropics
• donation programme of mass treatment with ivermectin
• approximately 50 million people treated to date(target is 80 million by 2015)
• serious adverse reactions experienced by some patientshighly co-infected with Loa loa parasites
• precautionary measures put in place before masstreatment in areas of high Loa loa prevalence
http://www.who.int/pbd/blindness/onchocerciasis/en/
Diggle et al, Annals of Tropical Medicine and Parasitology,101, 499–509.
The Loa loa prediction problem
Ground-truth survey data
• random sample of subjects in each of a number of villages
• blood-samples test positive/negative for Loa loa
Environmental data (satellite images)
• measured on regular grid to cover region of interest
• elevation, green-ness of vegetation
Objectives
• predict local prevalence throughout study-region (Cameroon)
• compute local exceedance probabilities,
P(prevalence > 0.2|data)
Loa loa: a generalised linear model
• Latent spatial process
S(x) ∼ SGP{0, σ2, ρ(u)}
ρ(u) = exp(−|u|/φ)
• Linear predictor
d(x) = environmental variables at location x
η(x) = d(x)′β + S(x)
p(x) = exp{η(x)}/[1 + exp{η(x)}]
• Conditional distribution for positive proportion Yi/ni
Yi|S(·) ∼ Bin{ni, p(xi)}
The modelling strategy
• use relationship between environmental variables andground-truth prevalence to construct preliminarypredictions via logistic regression
• use local deviations from regression model to estimatesmooth residual spatial variation
• use fitted model for predictive inference
logit prevalence vs elevation
0 500 1000 1500
−5
−4
−3
−2
−1
0
elevation
logi
t pre
vale
nce
logit prevalence vs max NDVI
0.65 0.70 0.75 0.80 0.85 0.90
−5
−4
−3
−2
−1
0
Max Greeness
logi
t pre
vale
nce
Comparing non-spatial and spatial predictionsin Cameroon
Non-spatial
Predicted prevalence - 'without ground truth data'
3020100
Obse
rved p
reva
lence
(%
)60
50
40
30
20
10
0
Spatial
Predicted prevalence - 'with ground truth data' (%)
403020100
Obs
erve
d pr
eval
ence
(%
)
60
50
40
30
20
10
0
Probabilistic prediction in Cameroon
Take-home message
• model-based approach:
– makes assumptions explicit
– makes choice of analysis strategy less subjective
– emphasises uncertainty
• exceedance probabilty maps are often more useful thanpoint predictions and standard errors
• text-book linked to geoR software
Diggle, P.J. and Ribeiro, P.J. (2007). Model-based Geostatistics.New York : Springer.