[email protected] Evaluating data quality issues from an industrial data set Gernot...
-
date post
19-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of [email protected] Evaluating data quality issues from an industrial data set Gernot...
![Page 1: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/1.jpg)
Evaluating data quality issues from an industrial
data set
Gernot LiebchenBheki Twala
Mark Stephens Martin Shepperd
Michelle Cartwright
![Page 2: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/2.jpg)
What is it all about?
• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work
![Page 3: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/3.jpg)
Motivations
• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)
• Predictive accuracy was highest with polishing followed by pruning and only then by filtering
• But suspicions were mentioned (at EASE)
![Page 4: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/4.jpg)
Suspicions about previous investigation
• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)
• Polishing alters the data (What impact can that have?)
• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?
![Page 5: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/5.jpg)
Why do we bother?
• Good quality data is important for good quality predictions and assessments
• How can we hope for good quality results if the quality of the input data is not good?
• The data is used for a variety of different purposes – esp. analysis and estimation support
![Page 6: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/6.jpg)
The Dataset
• Given a large dataset provided by a EDS• The original dataset contains more than 10
000 cases with 22 attributes• Contains information about software
projects carried out since the beginning of the 1990s
• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity
![Page 7: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/7.jpg)
Suspicions
• The data might contain noise • which was confirmed by the
preliminary analysis of the data which also indicated the existence of outliers.
![Page 8: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/8.jpg)
How could it occur? (in the case of the dataset)
• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous
• Misunderstood standards• The input tool might not provide range
checking (or maybe limited) • “Service Excellence” dashboard in head
quarters• Local management pressure
![Page 9: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/9.jpg)
Suspicious Data Example
• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low
FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)
![Page 10: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/10.jpg)
What imperfections could occur?
• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will
be called Noise because both are unwanted
![Page 11: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/11.jpg)
Noise Detectioncan be
• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)
• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)
![Page 12: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/12.jpg)
What to do with noise?
• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)
• 3 basic options of cleaning : Polishing, Filtering, Pruning
![Page 13: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/13.jpg)
Polishing/Filtering/Pruning
• Polishing – identifying the noise and correcting it
• Filtering – Identifying the noise and eliminating it
• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out
![Page 14: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/14.jpg)
What did we do? & How did we do it?
• Compared the results of filtering and pruning and discussed a implications of pruning
• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)
• Produced lists of “noisy” instances and polished counterparts
• Passed them on to Mark ( as metrics specialist)
![Page 15: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/15.jpg)
Results
• Filtering produced a list of 226 cases from 436
(36% in noise list/ in cleaned set 21%)
• Pruning produced a list of 191 from 436
(33% in noise list/ 25% in cleaned set)
• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)
![Page 16: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/16.jpg)
Results 2
• By just inspecting historical data it was not possible to judge which method performed better
• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge
![Page 17: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/17.jpg)
So what about polishing?
• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances
• It makes them fit into the regression model
• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?
![Page 18: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/18.jpg)
Issues/Conclusions
• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions
• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)
![Page 19: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/19.jpg)
Where to go from here?
• Rerun the experiment without “unrealistic cases”
• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is
![Page 20: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.](https://reader038.fdocuments.us/reader038/viewer/2022110322/56649d385503460f94a114cd/html5/thumbnails/20.jpg)
What was it all about?
• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work