Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf ·...
Transcript of Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf ·...
![Page 1: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/1.jpg)
by Anton Auoja, Albert Backenhof & Mikael Dalkvist
Detection of OutliersTNM033 - Data Mining
![Page 2: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/2.jpg)
Holy Outliers, Batman!!
“An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.” - Frank E. Grubbs
![Page 3: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/3.jpg)
![Page 4: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/4.jpg)
Holy Causes, Batman!!
Apparatus malfunction.
Fraudulent behavior.
Human error.
Natural deviations.
Contamination.
![Page 5: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/5.jpg)
Holy Applications, Batman!!
Fraud Detection
Medicine
Public Health
Sports statistics
Detecting measurement errors
![Page 6: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/6.jpg)
Holy WEKA, Batman!!
Interquartile Range
One Class Classifier
DBScan
![Page 7: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/7.jpg)
Holy Common Methods, Batman!!
Statistical
Distance
Kernel
High Dimensional
![Page 8: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/8.jpg)
Holy Statistical Methods, Batman!!
An outlier is an object with low probability with respect to the probability distribution model of the data.
Model Based.
Assume Gaussian distribution. Calculate the mean and standard deviation of the data. The probability of each object under the distribution can then be calculated.
![Page 9: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/9.jpg)
Holy Examples, Batman!!
Box Plots
Trimmed Means
Grubbs’ Test
![Page 10: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/10.jpg)
Holy Box and Whisker Plots, Batman!!
Interquartile RangeQ3 - Q1
Lower Inner Fence: Q1 - 1.5*IQR
Upper Inner Fence: Q3 + 1.5*IQR
Lower Outer Fence: Q1 - 3*IQR
Upper Outer Fence: Q3 + 3*IQR
![Page 11: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/11.jpg)
Holy Trimmed Means, Batman!!
Delete percentage of extreme values.
Calculate mean.
Use new mean for comparison.
![Page 12: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/12.jpg)
Holy Test, Grubbs!!
Calculate the normal logarithm.
Sort data.
Calculate Z.
Compare Z to the critical Z value.
![Page 13: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/13.jpg)
![Page 14: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/14.jpg)
Holy Issues, Batman!!
Identifying distribution of data set.
The number of attributes
Mixtures of distribution
![Page 15: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/15.jpg)
Holy Distance Based Methods, Batman!!
DP(p,D)
k-Nearest Neighbor
Local Distance Based
![Page 16: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/16.jpg)
Holy DB(p,D), Knorr & Ng, Batman!!
An object o is an outlier if at least the p:th fraction of all objects of the database are at a distance greater than D from the given object o.
![Page 17: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/17.jpg)
Holy Distance to k-Nearest Neighbors, Batman!!
Outlier score.
Score each object [0,∞[ depending on the distance to its k-nearest neighbors.
Highly dependent on the choice of k.
Can be modified to use the mean of distances of a point to all its 1NN, 2NN, ..., kNN as an outlier score.
![Page 18: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/18.jpg)
![Page 19: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/19.jpg)
Holy Local distance-based algorithms, Batman!!
Determine the difference of an object from its nearest neighbors.
A threshold value is set.
All objects whose outlier factors exceed this value are considered to be outliers.
Local Outlier Factor (LOF).
![Page 20: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/20.jpg)
Holy Advantages, Batman!!
More general and easier to apply then statistical approaches
No probabilistic model needed
Can find local outliers
![Page 21: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/21.jpg)
Unholy Disadvantages, Batman!!
Methods are typically O(n2)
Sensitive to choice of parameters
Dependent on pre-defined parameters
Can’t handle datasets with regions that have widely differing density
![Page 22: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/22.jpg)
Holy Kernel Based Methods, Batman!!
![Page 23: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/23.jpg)
![Page 24: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/24.jpg)
Originalspace
Hilbert(Feature)
space
![Page 25: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/25.jpg)
X H
![Page 26: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/26.jpg)
![Page 27: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/27.jpg)
Holy Implicitly, Batman!!
No additional memory or computation cost.
![Page 28: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/28.jpg)
Holy High Dimensional, Batman!!
Curse of Dimensionality
![Page 29: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/29.jpg)
One way is to create subspaces of original space.
![Page 30: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/30.jpg)
Another is Angle Based Outlier Degree.
![Page 31: Detection of Outliers - Linköping Universitystaffaidvi/courses/06/dm/Seminars2010/Outliers.pdf · Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata](https://reader034.fdocuments.us/reader034/viewer/2022051604/6004f8ad833b3b4dbf3e5227/html5/thumbnails/31.jpg)
Holy References, Batman!!
Outlier Detection Techniques. Hans-Peter Kriegel, Peer Kröger and Arthur Zimek. Ludwig-Maximilians-Universität München Munich, Germany.
A Review of Statistical Outlier Methods. Steven Walfish. Pharmaceutical Technology.
Outlier Detection Algorithms in Data Mining Systems. M. I. Petrovskiy. Department of Computational Mathematics and Cybernetics, Moscow State University, Vorob’evy gory, Moscow.
Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata Fallon and Christine Spada.
Outlier Detection with Kernel Density Functions. L. J. Latecki, A. Lazarevic, D. Pokrajac. 2008.
Classification by Support Vector Machines. F. Markowetz. Max-Planck-Institute for Molecular Genetics. 2002.
Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar. 2005.