Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
-
Upload
alonso-parvin -
Category
Documents
-
view
217 -
download
3
Transcript of Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
![Page 1: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/1.jpg)
Deliverable 2.8: Outliers
Gary BrownOffice for National Statistics
UK
![Page 2: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/2.jpg)
Outliers = Outlier detection and treatment aspects of combining
data (survey/administrative) including options for various
hierarchies
![Page 3: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/3.jpg)
Overview
• Introduction• Definitions• Identification• Treatment• Recommendations
![Page 4: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/4.jpg)
Introduction
• Deliverable 2.8 led by UK– UK leader worked in methodology over 14 years– Expert in Sample Design and Estimation for Business
Surveys– ... also expert in Small Area Estimation, Quality, Editing and
Imputation, Time Series Analysis
• QA by Italy
![Page 5: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/5.jpg)
Definitions
• Outliers • Errors• Outliers in survey data• Outliers in administrative data• Outliers in modelling• ... two glossaries considered: ONS and OECD
![Page 6: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/6.jpg)
Definitions – outliers
• OECD“A data value that lies in the tail of the statistical
distribution of a set of data values”
![Page 7: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/7.jpg)
Definitions – outliers
• OECD“A data value that lies in the tail of the statistical
distribution of a set of data values”• ONS
“A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample
weight that would have an undue influence on the estimate”
![Page 8: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/8.jpg)
Definitions – outliers
• OECD“A data value that lies in the tail of the statistical
distribution of a set of data values”• ONS
“A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample
weight that would have an undue influence on the estimate”
![Page 9: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/9.jpg)
Definitions – outliers
• OECD“A data value that lies in the tail of the statistical
distribution of a set of data values”• ONS
“A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample
weight that would have an undue influence on the estimate”
![Page 10: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/10.jpg)
Definitions – outliers
• OECD“A data value that lies in the tail of the statistical
distribution of a set of data values”• ONS
“A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample
weight that would have an undue influence on the estimate”
• Question 1: extreme (1) influential (2) both (3)
![Page 11: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/11.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules
![Page 12: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/12.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules
![Page 13: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/13.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules • OECD“A logical condition or a restriction which must be met
if the data is to be considered correct”
![Page 14: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/14.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules • OECD“A logical condition or a restriction which must be met
if the data is to be considered correct”• ONS
“A rule designed to detect specific errors in data for potential subsequent correction”
![Page 15: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/15.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules• OECD“A logical condition or a restriction which must be met
if the data is to be considered correct”• ONS
“A rule designed to detect specific errors in data for potential subsequent correction”
![Page 16: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/16.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules• OECD“A logical condition or a restriction which must be met
if the data is to be considered correct”• ONS
“A rule designed to detect specific errors in data for potential subsequent correction”
• Errors are corrected before outliers are considered
![Page 17: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/17.jpg)
Definitions – errors
• Errors are incorrect values identified by edit rules• OECD“A logical condition or a restriction which must be met
if the data is to be considered correct”• ONS
“A rule designed to detect specific errors in data for potential subsequent correction”
• Errors are corrected before outliers are considered
• Question 2: outliers = errors (1) outliers ≠ errors (2)
![Page 18: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/18.jpg)
Definitions – survey outliers
• In the survey context, an outlier is an unrepresentative value
![Page 19: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/19.jpg)
Definitions – survey outliers
• In the survey context, an outlier is an unrepresentative value
influential
![Page 20: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/20.jpg)
Definitions – survey outliers
• In the survey context, an outlier is an unrepresentative value
influential
• A unit sampled with probability 1/n is assumed to represent n-1 unsampled units in the population
• If the unit is unique, the assumption is invalid
![Page 21: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/21.jpg)
Definitions – administrative outliers
• In the administrative context, an outlier is an atypical value
![Page 22: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/22.jpg)
Definitions – administrative outliers
• In the administrative context, an outlier is an atypical value
extreme
![Page 23: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/23.jpg)
Definitions – administrative outliers
• In the administrative context, an outlier is an atypical value
extreme
• Administrative data represent a census, so each unit is treated as unique
• No assumptions
![Page 24: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/24.jpg)
Definitions – modelling outliers
• In the modelling context, an outlier is an influential value
![Page 25: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/25.jpg)
Definitions – modelling outliers
• In the modelling context, an outlier is an influential value
influential
![Page 26: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/26.jpg)
Definitions – modelling outliers
• In the modelling context, an outlier is an influential value
influential
• ONS“The amount of effect a particular point has on the
parameters of a regression equation”• Influence on processing and statistical modelling
![Page 27: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/27.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
![Page 28: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/28.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
• Processing – imputation“uplift last return by average growth in domain”
![Page 29: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/29.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
• Processing – imputation“uplift last return by average growth in domain”
• Statistical modelling
![Page 30: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/30.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
• Processing – imputation“uplift last return by average growth in domain”
• Statistical modelling
![Page 31: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/31.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
• Processing – imputation“uplift last return by average growth in domain”
• Statistical modelling
![Page 32: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/32.jpg)
Definitions – modelling outliers
• Processing – editing“fail if > 60% of maximum over past 5 years”
• Processing – imputation“uplift last return by average growth in domain”
• Statistical modelling
![Page 33: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/33.jpg)
Identification – units
• A data warehouse stores data once for repeated use
![Page 34: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/34.jpg)
Identification – units
• A data warehouse stores data once for repeated use• Each unit will have multiple values (variables/time
periods), and whether any value is – extreme depends on which other data are used– influential depends on what process/model is estimated
![Page 35: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/35.jpg)
Identification – units
• A data warehouse stores data once for repeated use• Each unit will have multiple values (variables/time
periods), and whether any value is – extreme depends on which other data are used– influential depends on what process/model is estimated
• Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted
![Page 36: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/36.jpg)
Identification – units
• A data warehouse stores data once for repeated use• Each unit will have multiple values (variables/time
periods), and whether any value is – extreme depends on which other data are used– influential depends on what process/model is estimated
• Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted
every unit in a data warehouse is a potential outlier
![Page 37: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/37.jpg)
Identification – units
• A data warehouse stores data once for repeated use• Each unit will have multiple values (variables/time
periods), and whether any value is – extreme depends on which other data are used– influential depends on what process/model is estimated
• Given repeated use, it is impossible to know how data domains will be defined or which models will be fitted
every unit in a data warehouse is a potential outlier
• Question 3: yes (1) no (2) unsure (3)
![Page 38: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/38.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
![Page 39: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/39.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
• Expected data uses & egs of identification methods
![Page 40: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/40.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures
![Page 41: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/41.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources
![Page 42: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/42.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources– survey (estimating variables & calibration weights) eg
winsorisation & setting acceptable ranges
![Page 43: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/43.jpg)
Identification – uses
• Assuming all units are potential outliers– identification becomes use dependent– outliers are recorded as part of the metadata of an output– outliers are not otherwise recorded in the data warehouse
• Expected data uses & egs of identification methods– processing eg comparing observed and expected edit failures– updating the business register eg comparing different sources– survey (estimating variables & calibration weights) eg
winsorisation & setting acceptable ranges – survey/admin (modelling relationship & estimating survey) eg
Cook’s distance & winsorisation
![Page 44: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/44.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
![Page 45: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/45.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
• Expected data uses & egs of treatment methods
![Page 46: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/46.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
• Expected data uses & egs of treatment methods– processing eg use medians rather than means
![Page 47: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/47.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source
![Page 48: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/48.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source– survey (estimating variables & calibration weights) eg
winsorisation & restrict to acceptable ranges
![Page 49: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/49.jpg)
Treatment – units in uses
• Identified outliers need to be treated during use– to prevent distortion – by adjusting the weight of the unit to 0 < P < 100%– balancing reducing variance and increasing bias (ie MSE)
• Expected data uses & egs of treatment methods– processing eg use medians rather than means– updating the business register eg delete one source– survey (estimating variables & calibration weights) eg
winsorisation & restrict to acceptable ranges – survey/admin (modelling relationship & estimating survey)
eg delete from modelling process & winsorisation
![Page 50: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/50.jpg)
Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
![Page 51: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/51.jpg)
Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
2. Identification and treatment of outliers should be unique to each instance data are used
![Page 52: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/52.jpg)
Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
2. Identification and treatment of outliers should be unique to each instance data are used
3. Metadata on outliers should only be included in a data warehouse alongside outputs
![Page 53: Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.](https://reader037.fdocuments.us/reader037/viewer/2022110205/56649cc05503460f94986f46/html5/thumbnails/53.jpg)
Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
2. Identification and treatment of outliers should be unique to each instance data are used
3. Metadata on outliers should only be included in a data warehouse alongside outputs
• Question 4: agree (1) disagree (2) discuss! (3)