Applied_geostats_intro

24
An introduction to applied geostatistics Part 4 – Model validation; simulation Overheads D G Rossiter Department of Earth Systems Analysis International Institute for Geo-information Science & Earth Observation (ITC) <http://www.itc.nl/personal/rossiter> July 9, 2005

Transcript of Applied_geostats_intro

Page 1: Applied_geostats_intro

An introduction to applied geostatistics

Part 4 – Model validation; simulation

Overheads

D G RossiterDepartment of Earth Systems Analysis

International Institute for Geo-information Science & Earth Observation (ITC)<http://www.itc.nl/personal/rossiter>

July 9, 2005

Page 2: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 1

Model validation

With any predictive method, we would like to know how good it is. This is modelvalidation .

• cf. model calibration , when we are building the model

The basic idea is to compare model predictions with reality . Two mainmethods:

1. Separate validation dataset

2. Cross-validation using calibration dataset

D G ROSSITER

Page 3: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 2

Independent validation

Simple measures of validity:

• Root mean squared error (RMSE) of the residuals : the actual vs. estimate(from the model) in the validation dataset; lower is better:

RMSE =

[1n

n

∑i=1

(yi −yi)2

]1/2

• Bias or mean error (ME) of estimated vs. actual mean of the validationdataset; should be zero (0)

ME =1n

n

∑i=1

(yi −yi)

D G ROSSITER

Page 4: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 3

Cross-validation

If we don’t have an independent data set to evaluate a model, we can use thesame sample points that were used to estimate the model to validate that samemodel.

This seems a bit dubious, but with enough points, the effect of the removed pointon the model (which was estimated using that point) is minor.

N.b. this is not legitimate for non-geostatistical models, because there is notheory of spatial correlation.

D G ROSSITER

Page 5: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 4

Cross–validation procedure

1. Compute experimental variogram with all sample points; model it

2. For each point

(a) Remove the point from the sample set(b) predict at that point using the other points and the modelled variogram

3. Summarize the deviations of the model from the actual point

Then models can be compared by their summary statistics, also by looking atindividual predictions of interest.

D G ROSSITER

Page 6: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 5

Summary statistics for cross–validation

• Root Mean Square Error (RMSE): lower is better ; computed as forindependent validation

• Bias or mean error (ME): should be 0; computed as for independent validation

• Mean Squared Deviation Ratio (MSDR) of residuals with kriging variance:should be 1

MSDR =1N

N

∑i=1

{z(~xi)− z(~xi)}2

σ2(~xi)

D G ROSSITER

Page 7: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 6

Cross-validation in gstat> # leave-one-out cross-validation> kcv<-krige.cv(log(cadmium)~1, ~x+y, meuse, model=m2); kcv‘data.frame’: 155 obs. of 8 variables:

$ x : num 181072 181025 181165 181298 181307 ...$ y : num 333611 333558 333537 333484 333330 ...$ var1.pred: num 1.482 1.649 1.452 1.259 0.952 ...$ var1.var : num 0.796 0.781 0.768 0.811 0.761 ...$ observed : Named num 2.460 2.152 1.872 0.956 1.030 ...

..- attr(*, "names")= chr "1" "2" "3" "4" ...$ residual : num 0.9774 0.5031 0.4195 -0.3033 0.0773 ...$ zscore : num 1.0953 0.5692 0.4786 -0.3368 0.0886 ...$ fold : int 1 2 3 4 5 6 7 8 9 10 ...> truehist(kcv$residual)

> # some residuals are very large, show their locations> bubble(kcv, z="residual", fill=F)> # measures of goodness: ME = 0, MSE low, MSDR = 1> mean(kcv$residual); mean(kcv$residual**2)> mean((kcv$residual)**2/kcv$var1.var)[1] 7.996923e-05[1] 0.8644106[1] 1.129047

D G ROSSITER

Page 8: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 7

Residuals from cross-validation and their location

−3 −2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

kcv$residual

residual

x

y

●● ●●

●●●

●●

●●

●●

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

330000

331000

332000

333000

178500 179000 179500 180000 180500 181000 181500

●●

−2.787−0.3840.0570.5652.424

D G ROSSITER

Page 9: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 8

Spatial simulation

Simulation is the process or result of representing what reality might look like,given a model.

In geostatistics, this reality is usually a spatial distribution (map).

D G ROSSITER

Page 10: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 9

What is stochastic simulation?

• “Simulation” is a general term for studying a system without physicallyimplementing it.

• “Stochastic” simulation means that there is a random component to thesimulation model: quantified uncertainty is included so that each simulation isdifferent.

• Non-spatial example: planning the number and timing of clerks in a newbranch bank; customer behaviour (arrival times, transaction length) isstochastic and represented by probability distributions.

• Reference for spatial simulation:Goovaerts, P., 1997. Geostatistics for natural resources evaluation. AppliedGeostatistics Series. Oxford University Press, New York; Chapter 8.

D G ROSSITER

Page 11: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 10

Why spatial simulation?

• Recall: the theory of regionalized variables assumes that the values weobserve come from some random process ; in the simplest case, with oneexpected value (first-order stationarity) with a spatially-correlated error thatis the same over the whole area (second-order stationarity).

• So we’d like to see “alternative realities”; that is, spatial patterns that, by thistheory, could have occurred in some “parallel universe”.

• In addition, kriging maps are unrealistically smooth , especially in areaswith low sampling density.

* Even if there is a high nugget effect in the variogram, this variability is notreflected in adjacent prediction points, since they are estimated from almostthe same data.

D G ROSSITER

Page 12: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 11

When must simulation be used?

Goovaerts: “Smooth interpolated maps should not be used for applicationssensitive to the presence of extreme values and their patterns of continuity .”(p. 370)

Example: ground water travel time depends on sequences of large or smallvalues (“critical paths”), not just on individual values.

D G ROSSITER

Page 13: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 12

Local uncertainty vs. spatial uncertainty

• Recall: kriging prediction also provides a prediction error ; this is the BLUPand its error for each prediction location separately .

• So, at each prediction location we obtain a probability distribution of theprediction, a measure of its uncertainty . This is fine for evaluating eachprediction individually.

• But, it is not valid to evaluate the set of predictions! Errors are by definitionspatially-correlated (as shown by the fitted variogram model), so we can’tsimulate the error in a field by simulating the error in each point separately.

• Spatial uncertainty is a representation of the error over the entire field ofprediction locations at the same time.

D G ROSSITER

Page 14: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 13

Practical applications of spatial simulation

• If the distribution of the target variable(s) over the study area is to be used asinput to a model , then the uncertainty is represented by a number ofsimulations.

• Procedure:

1. Simulate a “large” number of realizations of the spatial field2. Run the model on each simulation3. Summarize the output of the different model runs

• The statistics of the output give a direct measure of the uncertainty of themodel in the light of the sample and the model of spatial variability.

D G ROSSITER

Page 15: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 14

Unconditional simulation

In unconditional simulation, we simulate the field with no reference to the actualsample, i.e. the data we have. (It’s only one realistion, no more valid than anyother.)

This is mainly to visualise a random field as modelled by a variogram, not forprediction.

D G ROSSITER

Page 16: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 15

What is preserved in unconditional simulation?

1. Mean over field

2. Covariance structure

Data points are not predicted exactly.

D G ROSSITER

Page 17: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 16

Unconditional simulation in gstat

The krige function allows a number of simulation nsim to be specified. Forunconditional simulation, specify no data (data=NULL ), instead use thedummy=TRUEoption.

Since there is no data with which to estimate the mean, it must be specified asthe beta parameter.

> x <- krige(log(cadmium) ~ 1, ~ x + y, data = NULL, newdata = meuse.grid,+ model = m2, nmax = 20, nsim = 5, beta=mean(log(cadmium)), dummy = TRUE)[using unconditional gaussian simulation]> levelplot(z ~ x + y | name, map.to.lev(x, z=c(3:7)), aspect = mapasp(x),+ main = "five unconditional realisations of a correlated Gaussian field")

D G ROSSITER

Page 18: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 17

D G ROSSITER

Page 19: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 18

Conditional simulation

This simulates the field, while respecting the sample. So the simulated maps lookmore like the best (kriging) prediction, but usually much more spatially-variable(depending on the magnitude of the nugget).

These are inputs into spatially-explicit models, e.g. hydrology.

D G ROSSITER

Page 20: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 19

What is preserved in conditional simulation?

1. Mean over field

2. Covariance structure

3. Observed data (points are predicted exactly)

D G ROSSITER

Page 21: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 20

Conditional simulation in gstat

Here the data must be named, so the dummy=TRUEoption is not used. The beta

parameter may be given (usually as estimated from the data); if not it is estimatedby GLS.

> mean(log(cadmium))[1] 0.5610659> sims <- krige(log(cadmium) ~ 1, ~ x + y, meuse, meuse.grid,+ model = m2, nmax = 64, nsim = 6, beta=mean(log(cadmium)))[using conditional gaussian simulation]> levelplot(z ~ x + y | name, map.to.lev(sims, z=c(3:8)), aspect = mapasp(sims),+ main = "six conditional realisations of a correlated Gaussian field")

D G ROSSITER

Page 22: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 21

D G ROSSITER

Page 23: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 22

Indicator simulation

(See notes “Indicator Kriging”)

Indicator variables can also be simulated. Here the result is a 0/1 variable:indicator false/true. This is unlike IK where the result is a probability of a 1(indicator is true).

In gstat , the target value must be an indicator, and the indicator=TRUE argumentmust be included. The mean is estimated by the proportion of true indicators inthe sample.

threshold<-4indicator <- (cadmium >= threshold)vi<-variogram(indicator ~1, ~x+y, meuse)mi.f <- fit.variogram(vi, vgm(0.09, "Gau", 500, 0.11)sims <- krige(indicator ~ 1, ~ x + y, meuse, meuse.grid,+ model = vm1.f,+ nsim=6, indicator=TRUE,+ nmax=64, beta=sum(indicator)/length(indicator))levelplot(z ~ x + y | name,map.to.lev(sims, z=c(3:8)), aspect = mapasp(sims),main = "Six conditional realisations of an indicator variable")

D G ROSSITER

Page 24: Applied_geostats_intro

AN INTRODUCTION TO APPLIED GEOSTATISTICS 23

D G ROSSITER