An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang...
-
Upload
tyrone-mcdowell -
Category
Documents
-
view
220 -
download
0
Transcript of An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang...
![Page 1: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/1.jpg)
An Unbiased Distance-based Outlier Detection Approach for
High Dimensional Data
DASFAA 2011
ByHoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent
Presented BySalman Ahmed Shaikh (D1)
![Page 2: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/2.jpg)
Contents
• Introduction• Subspace Outlier Detection Challenges• Objectives of Research• The Approach– Subspace Outlier Score Function: FSout
– HighDOD Algorithm• Empirical Results and Analysis• Conclusion
![Page 3: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/3.jpg)
Introduction• An outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs. [1]
• Popular techniques of outlier detection– Distance based– Density base
• Since these techniques take full-dimensional space into account, their performance is impacted by noisy or irrelevant features.
• Recently, researchers have switched to subspace anomaly detection.
X
Y
N1
N2
o1
o2
o3
Anomalous Subsequence
o1, o2 and o3 are anomalous instances w.r.t. the data
![Page 4: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/4.jpg)
Subspace Outlier Detection Challenges
• Unavoidable exploration of all subspaces to mine full result set: – As the monotonicity property does not hold in the case of outliers,
one cannot apply apriori-like heuristic for mining outliers.
• Difficulty in devising an outlier notion:– Full-dimensional outlier detection techniques suffer the issue of
dimensionality bias in subspaces. – They assign higher outlier score in high dimensional subspaces than in
lower dimensions
• Exposure to high false alarm rate:– Binary decision on each data point (normal or outlier) in each
subspace flag too many points as outliers.– Solution is ranking-based algorithm.
![Page 5: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/5.jpg)
Objectives
• Build an efficient technique for mining outliers in subspaces, which should– Avoid expensive scan of all subspaces while still
yielding high detection accuracy– Eases the task of parameter setting– Facilitates the design of pruning heuristics to
speed up the detection process– Provide a ranking of outliers across subspaces.
![Page 6: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/6.jpg)
The Approach• The authors have made an assertion and given some definitions
to explain their research approach.
{ Non-monotonicity Property: Consider a data point p in the dataset DS. Even if p is not anomalous in subspace S of DS, it may be an outlier in some projection(s) of S. Even if p is a normal data point in all projections of S, it may be an outlier in S.
4
3
2
1
0
0 1 2 3 4
A
A is an outlier in full space but not in subspace
4
3
2
1
0
0 1 2 3 4
B
B is an outlier in subspace but not in fullspace
![Page 7: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/7.jpg)
(Subspace) Outlier Score Function• Outlier Score Function: Fout as given by Angiulli et al. for full space [2]
The dissimilarity of a point p with respect to its k nearest neighbors is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS.
– In order to ensure that non-monotonicity property is not violated, the outlier score function is redefined by the authors as below.
• Subspace Outlier Score Function: FSout
The dissimilarity of a point p with respect to its k nearest neighbors in a subspace S of dimensionality dim(S), is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS (projected onto S), normalized by dim(S).
– Where ps is the projection of a data point p DS onto S.∊
![Page 8: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/8.jpg)
FSout is Dimensionality Unbiased• FSout assigns multiple outlier scores to each data point and is dimensionality
unbiased.• Example: let k=1 and l=2• In Fig.(a), A's outlier score in the 2-dimensional space is 1/(2)1/2 which is the
largest across all subspaces. • In Fig.(b), the outlier score of B when projected on the subspace of the x-axis
is 1, which is also the largest in all subspaces. • Hence, FSout flags A and B as outliers.
![Page 9: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/9.jpg)
Subspace Outlier Detection Problem
• Using FSout for outliers in subspaces, mining problem now can be re-defined as
Given two positive integers k and n, mine the top n distinct anomalies whose outlier scores
(in any subspace) are largest.
![Page 10: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/10.jpg)
HighDOD-Subspace Outlier Detection Algorithm
• HighDOD (High dimensional Distance based Outlier Detection) is– A Distance based approach towards detecting outliers in very high-
dimensional datasets.– Unbiased w.r.t. the dimensionality of different subspaces.– Capable of producing ranking of outliers
• HighDOD is composed of following 3 algorithms– OutlierDetection– CandidateExtraction– SubspaceMining
• Algorithm OutlierDetection examine subspaces of dimensionality up to some threshold m = O(logN) as suggested by Aggarwal and Ailon in [3, 4]
![Page 11: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/11.jpg)
Algorithm 1: Outlier Detection• Carry out a bottom-up exploration of all subspaces of up to a
dimensionality of m = O(logN)
![Page 12: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/12.jpg)
• Estimate the data points’ local densities by using a kernel density estimator and choose βn data points with the lowest estimates as potential candidates .
Algorithm 2: CandidateExtraction
![Page 13: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/13.jpg)
Algorithm 3: SubspaceMining
• This procedure is used to update the set of outliers TopOut with 2n candidate outliers extracted from a subspace S.
![Page 14: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/14.jpg)
Empirical Results and Analysis
• Authors have compared HighDOD with DenSamp, HighOut, PODM and LOF.
• Experiments have been performed to compare detection accuracy and scalability.
• Precision-Recall trade-off curve is used to evaluate the quality of an unordered set of retrieved items.
• Datasets– 4 Real data sets from UCI Repository have been used.
![Page 15: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/15.jpg)
Comparison of Detection Accuracy
Detection accuracy of HighDOD, DenSamp, HighOut, PODM and LOF
![Page 16: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/16.jpg)
Comparison of Scalability• Since PODM and LOF yields unsatisfactory accuracy, they are not included
in this experiment.• Scalability test is done with CorelHistogram (CH) dataset consisting of
68040 records in 32-dimensional space.
Scalability of HighDOD, DenSamp and HighOut
![Page 17: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/17.jpg)
Conclusion
• Work proposed a new outlier detection technique which is dimensionality unbiased.
• Extends distance-based anomaly detection to subspace analysis.
• Facilitates the design of ranking-based algorithm.
• Introduced HighDOD, a ranking-based technique for subspace outlier mining.
![Page 18: An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649d975503460f94a8067c/html5/thumbnails/18.jpg)
References
[1] Wikipedia http://en.wikipedia.org/wiki/Outlier[2] Angiulli, F., Pizzuti, C.: Outlier mining in large high-
dimensional data sets. IEEE Trans. Knowl. Data Eng., 2005.
[3] Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB Journal, 2005.
[4] Ailon, N., Chazelle, B.: Faster dimension reduction. Commun. CACM, 2010.