Lyster MacCormack Data Spacing - Alberta Energy Regulator
Transcript of Lyster MacCormack Data Spacing - Alberta Energy Regulator
Data Spacing for Geological ModellingSteve LysterKelsey MacCormackAlberta Geological Survey402, Twin Atria Building4999-98 AvenueEdmonton, ABwww.ags.gov.ab.ca
One of the most frequently asked questions to geomodellers is, “How much data do you need?” Knowing how many picks are necessary, how much log analysis should be done, how many core samples must be collected is an open question. The answer is not straightforward due to changes in the variables being modelled, the geology under consideration, the scale of investigation, and the objective of the model.
Introduction
Synthetic Grids
Duvernay Formation Shale
References
Four synthetic grids of varying complexity were created to explore the effects of data density and sampling arrangements on modelling errors. Figure 7 shows the four grids. Each grid (1-4) consists of 10201 grid nodes that were sampled using multiple sampling schemes varying the number of data points (n=49, 81, 196, and 400) using 3 different sampling arrangements (regular, random, and clustered). These sampling schemes are meant to mimic the number of data available and sampling arrangements for a real project. Table 1 shows a summary of the sampling. Figure 8 shows the sample data extracted from grid 3.
Cardium Formation Tight Sandstone
MethodologyTwo different approaches were used to explore the relationship between data spacing/density and uncertainty. The unconditional simulation method of Wilde (2010) and Wilde and Deutsch (2013) was used to quantify the relationship between data spacing and uncertainty based on real data in the Duvernay and Cardium formations. This is a geostatistical simulation-based approach that resamples reference realizations at a variety of data spacings to quantify uncertainty in further realizations conditioned to the previously-simulated grids. This approach accounts for the univariate distribution and spatial structure of the data and is useful for situations where the true underlying variable is not exhaustively sampled.
In addition, synthetic grids were created to allow sampling at different spacings and for different sampling schemes. This approach has the advantage of using spatial variables that are defined at all locations, and so the models created using the artificial data can be compared to true values. The grids are modelled after real geologic features of varying complexity.
Three variables were used from the Duvernay Formation: net shale thickness, total organic carbon content, and porosity. These data come from well picks and log analyses used to assess the Duvernay in Rokosh et al. (2012); Figure 1 shows the locations of the data points. Figure 2 shows histograms of the data distributions. Net shale thickness is the most skewed and has the highest coefficient of variation, and TOC is the most symmetrical and has the lowest coefficient of variation.
Six data spacings were considered using the method of Wilde and Deutsch (2013): 800, 1600, 3200, 4800, 9600, and 19200 m. These correspond to data densities of 144, 36, 9, 4, 1, and 0.25 data per township. The spatial extent of the simulated grids was 134 km by 150 km, with about half of the cells within the grids being active (i.e., within the Duvernay extents). This is a large domain compared to the data spacing, meaning that local uncertainty is an important consideration in areas of higher or lower sampling density.
Rokosh, C.D., Lyster, S., Anderson, S.D.A., Beaton, A.P., Berhane, H., Brazzoni, T., Chen, D., Cheng, Y., Mack, T., Pana, C. and Pawlowicz, J.G. (2012): Summary of Alberta's shale- and siltstone-hosted hydrocarbon resource potential; Energy Resources Conservation Board, ERCB/AGS Open File Report 2012-06, 327 p.Wilde, B.J. (2010): Data Spacing and Uncertainty; M.Sc. thesis, University of Alberta, 103 p.Wilde, B.J. and Deutsch, C.V. (2013): Methodology for quantifying uncertainty versus data spacing applied to the oil sands; CIM Journal, vol. 4, no. 4, p. 211-219.
The porosity-thickness variable (Phi-H) was chosen as a proxy to represent the uncertainty in the Cardium Formation (D. Bechtel, unpublished data, 2013). Figure 4 shows the locations of the data points. Figure 5 shows a histogram of the data distribution. Seven data spacings were considered: 800, 1600, 3200, 4800, 6400, 8000, and 9600 m; corresponding to data densities of 144, 36, 9, 4, 2.25, 1.44, and 1 data per township. Figure 6 shows graphs of data spacing and data density vs. average local uncertainty. The spatial extent of the simulated grids was slightly larger than four townships in area, 21.2 km by 23.2 km. The domain is relatively small compared to the data spacing intervals considered, meaning the sampled univariate distribution had a significant impact on the results relative to the spatial distribution of data.
DiscussionThe examples presented here explore the relationship between data spacing (or density) and uncertainty. While it is very difficult to determine broadly applicable rules, some general guidelines and pitfalls are presented below.
Geological ComplexityThe Duvernay example is similar to the more complex grids (3 and 4) from the synthetic data, in that there is more complicated spatial structure and significant randomness that cannot be predicted, only accounted for. The Cardium example is similar to the simplest synthetic grid (1) in that there are more gradual changes and a clearer spatial structure. More complex geology requires more data to achieve the same level of confidence as simpler geology.
Diminishing ReturnsFrom the data density vs. uncertainty graphs (Figures 3, 6, and 10), it can be seen that the first several data per township provide the most value; that is, there are diminishing returns as more data are added. The more complex geological scenarios have a greater decrease in uncertainty within the first few wells per township. The simpler scenarios have a shallower curve, suggesting that the returns diminish quicker when the geology is well-behaved and more predictable as long as the underlying univariate distribution has been sufficiently sampled.
Data ClusteringThe clustered data sets from the synthetic example (Figure 8) show worse RMSE and bias over the entire modelled area than the other sampling schemes (Figures 9 and 10). In particular, additional sampling data from the more complex grids (3 and 4) adds little value when that data is clustered. If the important features (channels in this case) are not found already, clustering only reinforces existing data biases. Clustering adds risk to the modelling unless some other information, such as seismic, is available to guide the sampling. For simpler scenarios without the all-or-nothing channel features there is less risk in the clustering.
Domain SizeThe Duvernay data domain is quite large compared to the data spacing (Figure 1). This means that there are subareas within the domain that effectively have different data densities. The spatial uncertainty in this case is a very important factor, and sparser areas can be infilled to provide more value. There are also significant changes in the variables of interest over the domain that make some minimum level of sampling in all areas necessary to detect a trend. The Cardium domain is relatively small compared to the data spacing (Figure 4). In this case the univariate sampling distribution has a larger impact and the spatial uncertainty is less of a concern.
The sampling schemes resulted in 48 datasets that were brought into Petrel 2013. Each dataset was interpolated to the full spatial extent of the original synthetic grids. The grids nodes (n=10201) of all 48 surfaces were extracted and compared back to the original synthetic grids. This allowed us to assess the impact of data spacing and sampling distribution on the prediction accuracy of surfaces of variable complexity.
Root-mean-square-error (RMSE) was chosen as a representative statistic for the modelling results. The mean error was also used to assess the bias resulting from having limited data in different arrangements.
Figure 9 shows the cumulative histograms of errors for the modelled surfaces of grid 3. Similar histograms were made for all of the models, but are omitted for space. The regular sampling patterns have the narrowest and most symmetrical distributions of errors, with the random sampling results being only slightly worse. The clusted data sets produced the largest errors and most biased results, although there is a noticeable vertical portion of the distribution near zero error that represents the modelled cells near the data clusters.
Figure 10 shows the relationship between number of data, RMSE, and bias. In general more data produced better results (lower RMSE and mean error), although for grid 3 the clustered 81 data point set the sampling happened to miss several of the channels. Grid 4 is very complex and therefore the clustered sampling produced mixed results.
Figure 3 shows graphs of data spacing and data density vs. the average local coefficient of variation. The coefficient of variation was chosen to represent the local uncertainty because it implicitly corrects for the proportional effect, that is, higher variance in areas of higher data values.
0
10
20
30
40
50
grid1 Exhaustive Data
East
Nor
th
0 1010
101
grid3 Exhaustive Data
East
Nor
th
0 1010
101
grid2 Exhaustive Data
East
Nor
th
0
101
1010
grid4 Exhaustive Data
East
Nor
th
0
101
1010
Locations of 49 regular data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 49 random data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 49 clustered data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 81 regular data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 81 random data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 81 clustered data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 196 regular data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 196 random data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 196 clustered data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 400 regular datLocations of 400 regular data
0 2020 4040 6060 8080 100100
0
2020
4040
6060
8080
100100
Locations of 400 random data
0 20 40 60 80 100
0
20
40
60
80
100
Locations of 400 clustered data
0 20 40 60 80 100
0
20
40
60
80
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-80 -60 -40 -20 0 20 40 60 80 100
Cum
ula
ve F
requ
ency
Error
Errors for Grid 3, Regular Data
400 data
196 data
81 data
49 data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-60 -40 -20 0 20 40 60 80 100
Cum
ula
ve F
requ
ency
Error
Errors for Grid 3, Random Data
400 data
196 data
81 data
49 data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-100 -50 0 50 100 150 200 250 300
Cum
ula
ve F
requ
ency
Error
Errors for Grid 3, Clustered Data
400 data
196 data
81 data
49 data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 5000 10000 15000 20000
Coeffi
cien
t of V
aria
on
Data Spacing (m)
Duvernay Data Spacing vs. Uncertainty
Net ShaleTOCPorosity
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 20 40 60 80 100 120 140 160
Coeffi
cien
t of V
aria
on
Data Density (Data per Township)
Duvernay Data Density vs. Uncertainty
Net ShaleTOCPorosity
Peace RiverArch
Southern Leduc Shelf
Grosmont Carbonate
Platform
DeformedBelt
0 40 80 120 160 200 km
Duvernay FormationWells locations used for mapping
Duvernay Fm
Reef/Carbonate Platform (Switzer et al. 1994)
Urban areas
0 10 20 30 40 50 km
Pembina Cardium PoolCardium Phi-h data
Pembina Cardium Pool
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 2000 4000 6000 8000 10000 12000
Coeffi
cien
t of V
aria
on
Data Spacing (m)
Pembina Cardium Data Spacing vs. Uncertainty
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 20 40 60 80 100 120 140 160
Coeffi
cien
t of V
aria
on
Data Density (Data per Township)
Pembina Cardium Data Density vs. Uncertainty
Freq
uenc
y
Net Shale Thickness (m)
0.0 10.0 20.0 30.0 40.0 50.0 60.0
0.000
0.020
0.040
0.060
0.080
0.100
Duvernay Net ShaleNumber of Data 214
mean 15.16std. dev. 13.19
coef. of var 0.87
maximum 61.75upper quartile 20.70
median 10.96lower quartile 5.24
minimum 0.13
Freq
uenc
y
TOC Content (%)
0.00 1.00 2.00 3.00 4.00 5.00
0.000
0.020
0.040
0.060
0.080
0.100
Duvernay TOCNumber of Data 213
mean 2.55std. dev. 0.84
coef. of var 0.33
maximum 4.70upper quartile 3.08
median 2.53lower quartile 2.01
minimum 0.72
Freq
uenc
y
Porosity (Fraction)
0.000 0.050 0.100 0.150 0.200
0.000
0.040
0.080
0.120
Duvernay PorosityNumber of Data 144
mean 0.078std. dev. 0.035
coef. of var 0.447
maximum 0.193upper quartile 0.097
median 0.076lower quartile 0.051
minimum 0.014
Freq
uenc
y
PhiH
0.00 0.40 0.80 1.20
0.000
0.040
0.080
0.120
Pembina Cardium DataNumber of Data 91
mean 0.5480std. dev. 0.2708
coef. of var 0.4942
maximum 1.3700upper quartile 0.6700
median 0.5500lower quartile 0.3800
minimum 0.0000
-2
-1
0
1
2
3
4
5
6
7
8
0
2
4
6
8
10
12
14
0 100 200 300 400 500
Mean Error / Bias (Do
ed Lines)
RMSE
(Sol
id L
ines
)
Number of Data
Grid 1 Model Errors
-10
-5
0
5
10
15
20
25
0
10
20
30
40
50
60
70
0 100 200 300 400 500
Mean Error / Bias (Do
ed Lines)
RMSE
(Sol
id L
ines
)
Number of Data
Grid 2 Model Errors
-5
0
5
10
15
20
25
30
35
40
0
10
20
30
40
50
60
70
0 100 200 300 400 500
Mean Error / Bias (Do
ed Lines)
RMSE
(Sol
id L
ines
)
Number of Data
Grid 3 Model Errors
-10
-5
0
5
10
15
20
25
30
0
5
10
15
20
25
30
35
40
45
50
0 100 200 300 400 500
Mean Error / Bias (Do
ed Lines)
RMSE
(Sol
id L
ines
)
Number of Data
Grid 4 Model Errors
Data Points (Total # of data points)
Nominal Data Spacing (km)
Nominal Data Density (Wells per township)
49 12.3 0.6 81 9.6 1.0 196 6.2 2.4 400 4.3 4.9
Figure 3. Graphs of data spacing and density vs. average local uncertainty.
Figure 2. Histograms of Duvernay data.
Figure 1. Location map of Duvernay data.
Table 1. Data sampling schemes for synthetic grids.
Figure 4. Location map of Cardium data.
Figure 5. Histogram of Cardium data.
Figure 6. Graphs of data spacing and density vs. average local uncertainty.
Figure 7. Synthetic grids used in the example. Figure 8. The sample data extracted from grid 3, at four different data densities and three different arrangements. Figure 9. Cumulative histograms of themodelling errors for grid 3.
Figure 10. Number of data vs. RMSE and mean error (bias) .Blue is regular sampling; Red is random sampling; Green is clustered sampling.