ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010
description
Transcript of ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010
![Page 1: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/1.jpg)
www.StratAG.ie
Outlier Detection and the Estimation of Missing Values
Martin Charlton and Paul Harris
National Centre for GeocomputationNational University of Ireland Maynooth
Maynooth, Co Kildare, IRELAND
ESPON 2013 Programme WorkshopManaging Time Series and Estimating Missing Values
6 May 2010Luxembourg
![Page 2: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/2.jpg)
www.StratAG.ie
Outline
• Time Series
• ESPON DB data issues
• Detecting exceptional values
• Estimation of missing values
• Case study
![Page 3: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/3.jpg)
www.StratAG.ie
1: Time Series
![Page 4: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/4.jpg)
www.StratAG.ie
What is a time series?
• A variable which is measured sequentially in time at fixed sampling intervals is known as a time series
• The behaviour of such series can be modelled
• The main features of time series are trend and (sometimes) seasonal variation
• Observations which are close together in time tend to be correlated
![Page 5: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/5.jpg)
www.StratAG.ie
Air Passengers 1949-1960
Time
Pa
sse
ng
ers
(1
00
0's
)
1950 1952 1954 1956 1958 1960
10
02
00
30
04
00
50
06
00
A time plot of the number of air passengers per month between January 1949 and December 1960 in the USA reveals a rising trend
There is also a seasonal pattern of travel within each year. More people travel in the summer than the winter.
![Page 6: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/6.jpg)
www.StratAG.ie
Time
ag
gre
ga
te(A
P)
1950 1952 1954 1956 1958 1960
20
00
50
00
1 2 3 4 5 6 7 8 9 10 11 12
10
04
00
Aggregating the series annually reveals the rising trend, and the boxplot shows that more people travel in the summer months.
![Page 7: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/7.jpg)
www.StratAG.ie
Forecasting: 1
Holt-Winters filtering
Time
Ob
serv
ed
/ F
itte
d
1950 1952 1954 1956 1958 1960
10
02
00
30
04
00
50
06
00
There are many modelling and forecasting techniques.
Here we use the Holt Winters procedure to model the series behaviour…
The fit is quite promising
![Page 8: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/8.jpg)
www.StratAG.ie
Forecasting: 2
Time
1950 1955 1960 1965
10
02
00
30
04
00
50
06
00
70
08
00
And if the growth of the US air traffic during the first 4 years of the 1960s follows the pattern of the previous 12…
the forecast is for some 800 million passengers by 1965
![Page 9: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/9.jpg)
www.StratAG.ie
Models
• There are a wide variety of different models, including– Basic stochastic models (like Holt Winters)– Stationary models (AR, MA, ARMA)– Non-stationary models (ARIMA, ARCH)– Spectral analysis (based on the Fourier
transform)– Multivariate models (two or more series are
involved)
![Page 10: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/10.jpg)
www.StratAG.ie
2: ESPON DB Data Issues
![Page 11: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/11.jpg)
www.StratAG.ie
Some typical data… household income
The NUTS2 regions in Austria are the Länder – here we have short time series concerning disposable income of private households from 1995 to 2007. Each series has only 13 elements
We might normalise these by the population to reach a comparable ‘per capita’ figure
![Page 12: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/12.jpg)
www.StratAG.ie
Short series…
• We should be aware that there is an interaction between the amount of data available and what can be done with it
• Paas, Kusk, Schlitte and Võrk’s 2007 analysis of income convergence in selected countries of the EU using NUTS3 data had this to say:
![Page 13: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/13.jpg)
www.StratAG.ie
George Box, 1976, Science and Statistics
• Models include not just the analytical tools that others might use, but those which we use to examine the data for outliers and estimating values
• ‘Wrong’ for Box includes models that fail to encapsulate the process under investigation
![Page 14: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/14.jpg)
www.StratAG.ie
ESPON Tigers
• Long time series tend to be for large areal units, such as countries, or major administrative regions – the MAUP may well also be a tiger
• Smaller regions…– shorter series– incomplete series– a long time period between elements
(decennial censuses) in the case of very small units
![Page 15: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/15.jpg)
www.StratAG.ie
3: Detecting Exceptional Values
![Page 16: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/16.jpg)
www.StratAG.ie
Exceptional values
• Two types:1. Logical errors (e.g. negative unemployment rate)2. Statistical outlier (e.g. unusually high
unemployment rate)
• Identification methods1. Logical errors: mechanical (& statistical) techniques2. Statistical outliers: statistical techniques
![Page 17: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/17.jpg)
www.StratAG.ie
Types of outliers
![Page 18: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/18.jpg)
www.StratAG.ie
Our approach
• There is no single ‘best’ detection technique, so…1. Apply a selection of outlier detection methods, which
are simple and robust2. Flag an observation if it is a likely outlier according to
each technique3. Build up the weight of evidence for the likelihood of an
value being statistically exceptional4. Suggest what type of outlier it is likely to be
– aspatial, spatial, temporal, relationship, a mixture
5. Consult an expert of the data to decide on the appropriate cause of action
![Page 19: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/19.jpg)
www.StratAG.ie
Issues
• Temporal outliers• The time series are often too short to apply a
‘standard’ technique reliably• So... Parallel time series are treated as additional
variables (there will be a high positive correlation between series from different years)
• Then... Apply an aspatial/spatial/relationship detection technique
• That is... We add the spatial component which is then treated either implicitly or explicitly
• Modifiable Areal Unit Problem MAUP• Identify exceptional values at the finest spatial
resolution
![Page 20: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/20.jpg)
www.StratAG.ie
Weight of evidence
• If we apply a range of techniques, then we can build up the weight of evidence for the likelihood of an observation being exceptional
• Observations which are exceptional on most or all of the tests are those which we would select for further investigation
• Here’s an example showing three observations…
![Page 21: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/21.jpg)
www.StratAG.ie
Identification technique Identification type Obsn. 1 Obsn. 2 Obsn. 3
1. Boxplot Aspatial & univariate Yes Yes
2. Bagplot Aspatial & bivariateRelationship
Yes
3. Residuals from locally weighted mean & Hawkins test statistic
Spatial & univariate Yes Yes
4. Residuals from multiple linear regression*(requires modelling decisions)
Aspatial & multivariateLinear relationships
Yes
5. Residuals from locally weighted regression*(requires modelling decisions)
Aspatial & multivariateNonlinear relationships
Yes Yes
6. Residuals from geographically weighted regression* (requires modelling decisions)
Spatial & multivariateNonlinear relationships
Yes
7. Basic & robust principal component analysis* (model-decision free)
Aspatial & multivariateLinear relationships
Yes
8. Locally weighted principal component analysis* (model-decision free)
Aspatial & multivariateNonlinear relationships
Yes
9. Geographically weighted principal component analysis* (model-decision free)
Spatial & multivariateNonlinear relationships
Yes Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
![Page 22: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/22.jpg)
www.StratAG.ie
4: Estimating Missing Data
![Page 23: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/23.jpg)
www.StratAG.ie
Data estimation techniques
• There is an enormous range of possibilities– Choice depends on
• Data type, size, dimensionality, and properties• Objective – prediction or prediction uncertainty accuracy• Model complexity
– We can estimate missing values using...• Averaging• Regression (with or without autocorrelation, global and
local)• Inverse distance weighting• Regression Kriging• Co-Kriging• Bayesian Markov Chain Monte Carlo methods
![Page 24: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/24.jpg)
www.StratAG.ie
5: Case study
Identifying NUTS regions with exceptional time-series values
![Page 25: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/25.jpg)
www.StratAG.ie
Unemployment at NUTS 23 2000-2007
• A dataset for NUTS23 regions was obtained from UMS-RIATE
• For each year there are counts of – Economically active population– Unemployed, economically active population
• Shapefile created from NUTS2/NUTS3 shapefiles in Mapkit
• Analysis undertaken in R
![Page 26: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/26.jpg)
www.StratAG.ie
Eight ‘unemployment rate’ variables for 2000 to 2007
Rate = [Unemployed/Economically active]
790 x 8 observations at NUTS 2/3 level
Some island data removed
![Page 27: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/27.jpg)
www.StratAG.ie
Data post-processing
• Logical input errors– Original data checked– There appear to be none, appear to be a few exceptional
values
• Assessing outlier detection methods– 320 values randomly picked (~5% of the data)
• These are in 271 regions
– Values doubled and then randomly redistributed among the 320 positions in the data
– These observations are assumed to be outlying in some way (but we cannot guarantee this)
![Page 28: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/28.jpg)
www.StratAG.ie
Effect ofoutliers?
Merely looking at some maps doesn’t help in easily identifying the regions with exceptional values
![Page 29: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/29.jpg)
www.StratAG.ie
Interseries correlations
Those plots about the main diagonal are highly correlated.
The effect of the randomly introduced values is clearer on the more distant plots (these are also ‘distant’ in time)
![Page 30: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/30.jpg)
www.StratAG.ie
Detection Techniques for comparison
• Simple time-series approach (TS) – outlined in FIR: we have used a simplified version
• Principal Components Analysis (PCA)• GWPrincipal Components Analysis
(GWPCA)– The PCA based methods allow us to consider
more than simply pairs of time series simultaneously
![Page 31: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/31.jpg)
www.StratAG.ie
We’ll compare the various methods
![Page 32: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/32.jpg)
www.StratAG.ie
Time Series method (TS)• For each of the 790 regions, index TS is calculated at each
of 8 time observations (using the 8-observation data set):
• TS = [observation – mean]2/[variance]
• Assuming Gaussian errors, a time observation is taken as outlying if TS > 3.84 (95% level)
• In this study, we simply find outliers according to boxplot statistics
• An indicator variable is then set at any region for which at least one time observation is outlying
![Page 33: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/33.jpg)
www.StratAG.ie
Principal Components Analysis (PCA)
• Principal Components Analysis is a technique which transforms m correlated variables into m new variables which are have a correlation of zero
• All of the variance in the original m variables is retained during the transformation
• Values of the new variables are known as scores – we can use these for identifying exceptional values
![Page 34: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/34.jpg)
www.StratAG.ie
Geographically Weighted PCA
• PCA is a global transformation but it ignores the spatial arrangement of the NUTS regions
• With GWPCA we obtain local transformations by applying geographical weighting – this gives us a set of components for each NUTS region
• We can use the scores from these local transformations to identify exceptional values
![Page 35: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/35.jpg)
www.StratAG.ie
PCA for the unemployment series
The series are highly correlated, so the first component accounts for the majority of the variance
![Page 36: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/36.jpg)
www.StratAG.ie
Using PCA and GWPCA
• Examine the residual component data (those with small variances)
• Use boxplot statistics to define outlying values
• In this case, a significant result indicates one or more outlying time observations in a NUTS region
• GWPCA will also indicate a spatial ‘outlyingness’ in the data
![Page 37: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/37.jpg)
www.StratAG.ie
The various techniques are compared on the next slides
![Page 38: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/38.jpg)
www.StratAG.ie
(a) TS method compared with PCA
The TS method appears to be less discriminating than the global PCA method
![Page 39: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/39.jpg)
www.StratAG.ie
(b) TS compared with GWPCA
The GWPCA method would appear to be very discriminating in identifying potentially exceptional regions
![Page 40: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/40.jpg)
www.StratAG.ie
(c) PCA compared with GWPCA
The global PCA is slightly less discriminating than the GW PCA
![Page 41: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/41.jpg)
www.StratAG.ie
Results for the 271 randomised sites
• Sites not identified as outlying – 21.4%
• Outlying by at least one method – 78.6%
• Outlying by one method only – 55.3%• Outlying by two methods – 18.8%• Outlying by all three methods – 4.8%
![Page 42: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/42.jpg)
www.StratAG.ie
• Identification by method: – TS (75.6%) – PCA (22.5%) – GWPCA (8.8%)
• False positives at 519 un-affected sites:– TS (29.5%) – PCA (2.3%)– GWPCA (1.3%)
• These results endorse the “weight of evidence” approach to the identification of exceptional values…
![Page 43: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/43.jpg)
www.StratAG.ie
Acknowledgements
• We are disappointed that Eyjafjallajökull decided to send some ash to Ireland
• We are deeply grateful to Claude for presenting this work – some of it is not easy
• We also acknowledge statistical advice from Professor Chris Brunsdon, Professor of Geographic Information at the University of Leicester
![Page 44: ESPON 2013 Programme Workshop Managing Time Series and Estimating Missing Values 6 May 2010](https://reader035.fdocuments.us/reader035/viewer/2022070409/568144ac550346895db17650/html5/thumbnails/44.jpg)
www.StratAG.ie
Thank You!