Spatial data representation: an improvement of statistical dissemination for … · Spatial data...

Spatial data representation: an improvement of statistical dissemination for policy analysis

Edoardo Pizzoli1, Chiara Piccini2

1Istat, Istituto Nazionale di Statistica, e-mail: [email protected] 2Freelance professional, e-mail: [email protected]

Abstract Spatialization of statistical information is often intended as the distribution of statistical data within administrative areas, represented as homogeneous. In this way, the spatial difference between statistical data often cannot be revealed, because of the low spatial resolution. Given the availability of point georeferenced data, the application of geostatistical interpolation methods allows to produce a spatial representation of statistical information more adherent to the reality, and not necessarily linked to the administrative boundaries. Maps obtained with these techniques not only enhance the dissemination of statistical information at territorial level, but also represent a valuable tool for policy makers. In this paper the results of a study performed on economic data at NUTS3 level (Province) in Italy are presented. Using GIS technology, it is possible to explore the relationship between the distribution of statistical data and some geographical features in the same region. Keywords: Spatial distribution, Geostatistics, Rural development Acknowledgements: This study was performed within the activities of the Work Package 9 of the SSH FP7 Project BLUE-ETS Grant agreement no. 244767, funded by the European Commission, DG Research and Innovation, Unit L Science, Economy and Society, Seventh Framework Programme, Theme 8, Socio-economic Sciences and Humanities. 1. Introduction Statistical units as defined by international manuals for official statistics, such as Institutional and (local) kind-of-activity units, provide information on the unit itself at the time of interview (as an answer to a questionnaire) and on its components (individuals, establishments, land, etc.). Since these units are geographically identified at a specific time, the hypothesis tested in this work is that they bring information on the territory they belong to. Thus, their individual characteristics are proxies of the territorial characteristics. Some socio-economic features (well-being, wealth, competitiveness, etc.) for geographical areas are defined by the theory but are not directly observable and, as such,

strictly measurable. For other characteristics, even if directly measurable, data are not available over the whole territory. All of them are related to the earth surface, to the natural environment and to the land cover: the idea behind is that natural environment significantly differentiates human opportunities and its behaviour. Available data at the statistical units level can be used to expand information on administrative areas as soon as there is empirically a space correlation in the studied phenomena. If such a statistical relationship exists among a set of units with respect to a specific characteristic, a spatial analysis is possible. A representation on a map or cartogram will be always considered to represent in space the phenomena under investigation. Geostatistical methods are usually utilized for the analysis of natural phenomena. The main assumption is that point values are influenced by the values in nearest points (spatial autocorrelation), and the phenomenon is continuously distributed over the territory. Indeed, natural phenomena have often a continuous variation over the territory – their variation is not completely random – but they can be measured only in few specific points (samples). Then, several methods can be used to represent their actual spatial distribution from sample point data. Dealing with socio-economic variables, the procedure is to some extent the opposite: variables measured in each point are discrete by nature. They can be elaborated by geostatistical methods assuming both that their values are spatially dependent, and that the analysed phenomenon is continuously distributed over the territory. These assumptions are justified by the typology of the phenomenon itself: the basic hypothesis is that the analysed units can represent measurement points of phenomena distributed over the whole territory. For socio-economic data the spatial autocorrelation is not always verified for the whole dataset, but it should be verified for a limited group of data (Demetrio and Guerreschi 2010). The basic objective is not an “accurate” or “exact” geostatistical analysis: the estimation does not provide an “exact” picture of how the phenomenon is distributed over the territory, but rather the basic tendency and areas of strength and weakness, implicitly assuming that the phenomenon has a clear spatial dimension over the territory (Celata 2005). The graphical result, undoubtedly effective, is certainly better than a thematic cartogram which forces the data into administrative boundaries, often arbitrarily set. 2. Materials and methods 2.1 Spatial dependence and geostatistics When we talk about spatial data we mean that each data value is associated with a location in space and there is at least an implied connection between the location and the data value. How conditions tend to persist locally, and how it is possible to divide the surface into regions that exhibit substantial internal similarity is known as spatial dependence (de Smith et al. 2011). Spatial dependence is the degree of relationship that exists between two or more spatial variables, such that when one changes, the other(s) also change. This change can either be in the same direction, which is a positive autocorrelation, or in the opposite direction, which is a negative autocorrelation (Cliff and

Ord 1981). Spatial dependence leads to the spatial autocorrelation problem in statistics, since this violates standard statistical techniques assumption of independence among observations. The spatial autocorrelation is defined as the variation of a property within a geo-space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial autocorrelation is the matter of geostatistics. Spatial statistics is a term commonly applied to the analysis of discrete objects (e.g. points, areas), and is defined by both its material (spatial datasets) and its methods. Three main topics can be identified in spatial statistics: i) point pattern analysis, corresponding to a location-specific view of the data; ii) lattice or regional analysis, corresponding to zonal models of space; iii) geostatistical modelling, applying to a continuous field view of the underlying dataset (Cressie 1993). Aiming to represent statistical data on a map, one of the main problem to be solved is to obtain information in unsampled locations from measured data. Spatial prediction models (algorithms) can be classified, according to the amount of statistical analysis included, in deterministic models - where arbitrary or empirical model parameters are used, e.g. Inverse Distance Weighting, Radial Basis Function – and statistical models - the model parameters are estimated in an objective way, following the probability theory, e.g. geostatistics. In general, deterministic prediction models are more primitive than the statistical models and often sub-optimal; however, there are situations where they can perform as good as the statistical models or even better. No estimate of the model error is available for predictions with deterministic models, while the predictions with statistical models are accompanied with the estimate of the prediction error. A drawback in this second case is that the input dataset usually need to satisfy strict statistical assumptions. Basic geostatistical analysis usually has the following steps: i) calculation of the semivariogram, a function that describes the differences (variance) between samples separated by varying distances, ii) estimation of parameters of the semivariogram model, and iii) estimation of the surface using kriging algorithm, an optimal interpolator that generates best linear unbiased estimate at each location, employing semivariogram model (Goovaerts 1997; Isaaks and Srivastava 1989). The most commonly used kriging algorithm is the Ordinary Kriging (OK). A validation of the estimation can be added. Cross validation is a technique used to assess how accurate an interpolation model is. One point is left out and the rest is used to predict a value at that location. The point is then added back into the dataset, and a different one is removed. This is done for all samples in the dataset and provides pairs of predicted and known values that can be compared to assess the model's performance. Results are usually summarized as Mean and Root Mean Squared errors (ME and RMSE). A normal distribution of the data is usually a prerequisite for the application of geostatistics: OK may give unacceptable results if the data are severely non-normal. Indicator kriging (IK) is another geostatistical approach to geospatial modeling, which makes no assumption of normality and is essentially a non-parametric counterpart to OK. IK uses indicator (0 or 1) variables to generate probabilities that a critical value was exceeded or not at each location in the study area, and then proceeds the same as ordinary kriging (Journel 1983). Notice that because the indicator variables are 0 or 1, the interpolations will be between 0 and 1, and predictions from indicator kriging can be interpreted as probabilities of the variable being in the class that is indicated by 1. If a threshold was used to create the indicator variable, the resulting interpolation map would

show the probabilities of exceeding (or being below) the threshold. This allows to produce probability maps and risk maps. For the theory of geostatistics, see for example Clark (1979), Isaaks and Srivastava (1989), Goovaerts (1997), Houlding (2000). 2.2 Dealing with outliers As said before, OK is usually applied to data that show a normal distribution. A problem that should be addressed, especially in socio-economic data elaboration, is the possible presence of outliers in the dataset. The outliers may provide some useful information concerning the magnitude of the phenomenon, but at the same time may unduly influence the results of the analysis unless some adjustment is made. A possible solution is a logarithmic transformation of the considered variable, that on the one hand allows to approach a normal distribution, but on the other hand flattens out the data, completely loosing the information brought by outliers. Winsorising (or Winsorization) is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. Winsorization involves replacing a fixed number of extreme scores with the score that is closest to them in the tail of the distribution in which they occur. That is, the extreme values are moved toward the centre of the distribution. Data Winsorising procedure is as follows: 1. a priori identification of a threshold beyond which data are considered outliers; typically the threshold is set to a specified percentile of the data, to the mean plus standard deviation, or to the median plus the interquartile range; 2. proper Winsorization: data exceeding the upper threshold (or under the lower one) are replaced by the threshold value. Note that Winsorising is not equivalent to simply excluding data (Trimming). In a trimmed dataset, the extreme values are discarded; in a Winsorized dataset, the extreme values are instead replaced by other values. Winsorization has the advantage to hold all the observations, but assigning a different weight to those in the distribution tails. This is particularly useful when dealing with a small-sized dataset, because it allows anyway to consider all samples and the associated information. The transformation can be applied to a single tail of the distribution or symmetrically to both of them. An asymmetric Winsorization introduces a certain bias in the estimation. 2.3 An application example to the Province of Palermo Aiming to analyse rural development, data for the nine Provinces of the Sicily Region were available. Data came from ASIA-Agricoltura, the Business Register of Active Enterprises that includes all the enterprises with a primary or secondary activity in agriculture (year 2009). The process of representing data has the following steps: 1. specification of the territorial administrative level of interest; 2. normalization of the units’ addresses (data quality control); 3. assignment of geographical coordinates to units, starting from addresses; 4. correction of errors on coordinates; 5. mapping point location, visualizing their spatial distribution;

6. variographic analysis of data; 7. estimation in non-sampled points by means of the appropriate algorithm; 8. mapping estimated values over the territory; 9. cross-validation to test the accuracy of the estimation. Normalization and georeferencing of the addresses are carried out by the EGON software, which allows to analyze territorial data and to decode them, in order to accurately map them. Starting from an address list, the software provides the geographical coordinates of the correspondent points. Because of different reasons, some addresses could not be recognized by the software: in this case, instead of the true coordinates, the software gives an approximation – the coordinates of the centroid of the census section, or even the coordinates of the centroid of the Municipality. If those “wrong” data are a limited number, they are individuated with GoogleTM Earth and the right coordinates are manually transcribed. If the wrong records are due to takeover errors and/or to data entry errors, and the addresses are not identifiable anyway, they are discarded and not used in the following elaborations. Once georeferenced the data, indicators about the presence, economic intensity and productivity of agro-industrial enterprises and farms, expressed as number of employees, agricultural revenues, total revenues, revenues/number of employees, agricultural/total revenues, and number of employees in agricultural activities were considered. Selected indicators are certainly relevant for rural development, even if not exclusive and exhaustive of the phenomenon. Each Province was treated separately. As an example of several solutions for data elaboration and representation, results from the Province of Palermo are reported, considering the revenues/number of employees ratio (R/N). The variographic analyses were performed using the software VarioWin 2.21 (Pannatier 1995); mapping and cross validation were performed using the extension Geostatistical Analyst of the software ArcGIS 9.2®. 3. Results and discussion The frequency distribution of the variable R/N show a non-normal distribution, and the presence of some outliers (Fig. 1a), which are not measurement errors. Thus, we are not in an optimal situation for the application of geostatistical methods; the variable can be transformed, in order to get a distribution that approach normal. Three different transformations were performed: a logarithmic one, a winsorization using as threshold the mean value plus two times the standard deviation - identifying just one outlier - and trimming at the same threshold. In Fig. 1 the new frequency distributions are reported.

0

0,2

0,4

0,6

0,8

1

0 500 1000 1500 2000 2500

Rela

tive f

requence

R/N a) original dataset

0

0,05

0,1

0,15

0,2

0,25

0 2 4 6 8

Rela

tive f

requence

LogR/N b) logarithmic transformation

0

0,1

0,2

0,3

0,4

0,5

0 100 200 300 400

Rela

tive f

requence

Wins_R/N c) winsorized dataset

0

0,1

0,2

0,3

0,4

0,5

0 50 100 150 200 250 300 350 400 450

Rela

tive f

requence

R/N d) trimmed dataset

Figure 1 – Frequency distribution of the variable R/N in the Province of Palermo, original and transformed data

Considering just the frequency distribution, the logarithmic transformation gives the best result among the proposed ones. Winsorized and trimmed datasets show a better frequency distribution than the original dataset, but still far from normal. For spatial analysis and representation, anyway, other factors must be taken into account. As a first spatial approach, a deterministic interpolator - the Radial Basis Function (RBF) - was applied to the original dataset to predict the R/N distribution over the territory; the obtained map is reported in Fig. 2.

Figure 2 – Map for the variable R/N in the Province of Palermo, spatialized by RBF The presence of spatial autocorrelation and its spatial structure for the original variable was also verified by means of a variographic analysis. An omnidirectional semivariogram (assuming that there is no anisotropy, that is the semivariogram shape does not change in different directions) was built, adopting a lag of 5100 m for the Province of Palermo - choice based on the sampling density. In Fig. 3 the obtained empirical semivariogram is reported, together with the one for the logarithmic transformed variable.

a) R/N

b) log(R/N)

Figure 3 – Empirical semivariograms for the variables R/N and log(R/N) in the Province of Palermo, with adjusted models

In both cases, the spatial autocorrelation is weak, but a model can be adjusted to the point semivariograms, Spherical for R/N and Exponential for log(R/N); their parameters can be used in the Ordinary Kriging (OK) algorithm to estimate maps. The maps for non-transformed variable R/N and for log(R/N), with the associated estimation error maps, are reported in Fig. 4.

a) Estimated map (R/N) c) Estimated map (logR/N)

b) Prediction error map (R/N)

d) Prediction error map (logR/N)

Figure 4 – Estimated maps for the variables R/N and log(R/N) in the Province of

Palermo, and associated prediction error maps Even if the data frequency distribution is not normal, and the spatial autocorrelation is not strong, comparing the map in Fig. 4a with the one in Fig. 2 it can be observed that contour lines arrangement is very similar, and performing a cross-validation a similar value of RMSE for the two maps is obtained (202.33 for the one estimated by RBF, and 202.06 for the one estimated by OK). OK is preferable, since it provides in addiction the prediction error in each point of the territory. The estimation error is obviously higher where measured points are fewer or missing. Comparing the map in Fig. 4c with the formers, it should be noted that the high values located in the central area completely disappeared, and contour lines arrangement looks very flattened out towards low values. Even if the prediction error is much lower than in the case of non-transformed variable, and also RMSE from cross validation is significantly lower (185.00), this map shows a flattened variable range with respect to the observed data (Fig. 1a). Thus, the map of Fig. 4a is still preferable for practical purposes. As said before, also a winsorization and a trimming were performed on the original dataset, using mean value plus two times the standard deviation as threshold. To the empirical semivariograms for these transformed variables were adjusted two Hole Effect models; the plots are not reported here, because in the software used for calculation only the most used models (Spherical, Exponential, Gaussian) are provided. The map for winsorized R/N estimated by OK, and the associated estimation error map, are reported in Fig. 5.

a) Estimated map (winsorized dataset)

c) Estimated map (trimmed dataset)

b) Prediction error map (winsorized dataset)

d) Prediction error map (trimmed dataset)

Figure 5 – Estimated maps for winsorized R/N and trimmed R/N in the Province of

Palermo, and associated prediction error maps The map in Fig: 5a is quite similar to Fig. 2 and Fig. 4a, with respect to the general contour lines arrangement, while the one in Fig. 5c is more similar to Fig. 4c. Obviously, the value of the outlier has been lowered, thus the area with values belonging to the higher class is reduced. The estimation error is significantly lower than in interpolation with the original dataset, and similarly cross validation gives a RMSE of 76.03. Otherwise, in the case of the map in Fig. 5c the deletion of the outlier causes the complete loss of the associated information, and even if the cross validation gives the lowest RMSE (71.35), the estimation error is quite high everywhere. For our purposes, the best map could be the one in Fig. 5a. A different solution for prediction is the application of IK. The variable R/N was first transformed in an indicator variable using as threshold the value 26, proposed by the software. In Fig. 6 the obtained empirical semivariogram for the indicator variable is reported.

Figure 6 – Empirical semivariogram for the indicator variable with the adjusted

model The spatial autocorrelation is almost absent, but an Exponential model can be adjusted to the point semivariogram, and its parameters were then used in the Kriging algorithm to estimate a probability map. The resulting map, indicating the probability of exceeding the threshold of 26 in each point, and the associated prediction error map, are reported in Fig. 7.

a) Probability map

b) Error map

Figure 7 – Map showing the probability of exceeding the threshold of 26 in the

Province of Palermo, and associated prediction error map It can be easily observed that the higher probability of exceeding the threshold is in the area at west of the city of Palermo (where a cluster of points is present). All these maps can be implemented into a GIS software, in order to easily compare them with other thematic maps of the same area, and to study the relationships between the distribution of statistical data and the local geography. 4. Conclusions Estimated maps of socio-economic indicators, based on available micro-data at the statistical units level, represent a clear enhancement on the dissemination of statistical

information, and become a valuable tools for policy makers that are responsible of a specific administrative area. Visualizing statistical information on a map is not just a further dimension - the space - added to the data, but an improvement of socio-economic information linking them to the geographical characteristics, with respect to sub- (small areas) and across- (macro-areas) target administrative areas. Moreover, the maps are useful to identify areas of policy intervention, to plan and evaluate actions, to perform simulations and to get future scenarios. The corresponding error map is a further improvement, showing the limitations and the reliability of the statistical information practically available at the time of policy decision. The choice of the interpolation method depends essentially upon the type of available data and mostly upon the objective of the elaboration. References Celata, F. (2005). La geografia del sistema alberghiero romano e i sistemi ricettivi

periurbani. Working Papers del Dipartimento di Studi Geoeconomici Linguistici Statistici Storici per l’Analisi Regionale, n. 29 (in Italian).

Clark, I. (1979). Practical geostatistics. Applied Science Publishers, London. Available

at: URL=http://www.kriging.com/PG1979/ (Accessed January 2013). Cliff, A., and Ord, J. (1981). Spatial processes: models and applications. Pion, London. Cressie, N.A.C. (1993). Statistics for spatial data. John Wiley, New York. Demetrio, V., and Guerreschi, P. (2010). Analisi geostatistiche dei fenomeni socio-

economici: premesse metodologiche. GIS Day 2010 – I GIS nella didattica e nella ricerca, 1 dicembre 2010, Castello del Valentino, Torino (in Italian).

de Smith, M., Goodchild, M., and Longley, P. (2011). Geospatial Analysis: A

Comprehensive Guide to Principles, Techniques and Software Tools. Available at: URL=http://www.spatialanalysisonline.com (Accessed December 2012).

Goovaerts, P. (1997). Geostatistics for natural resources evaluation. Oxford University

Press, New York. Houlding, S.W. (2000). Practical Geostatistics: Modeling and Spatial Analysis. Springer-

Verlag, Berlin. Isaaks, E.H., and Srivastava, R.M. (1989). An introduction to applied geostatistics.

Oxford University Press, New York. Journel, A. (1983). Nonparametric estimation of spatial distribution. Mathematical

Geology, 15(3), 445-468.

Pannatier, Y. (1995). Software VarioWin 2.21. Institute of Mineralogy, University of Lausanne, Switzerland.

Spatial data representation: an improvement of statistical dissemination for … · Spatial data...

Documents

Transcript of Spatial data representation: an improvement of statistical dissemination for … · Spatial data...