RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren,...

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation

Dongmei Ren, Baoying Wang, William Perrizo

North Dakota State University, U.S.A

Introduction Related Work

Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities.

Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient.

Contributions of this paper 1. a relative density factor (RDF)

RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7]

but RDF is easier to compute; 2. RDF-based outlier detection method

it efficiently prunes the data points which are deep in clusters It detects outliers only within the remaining small subset of the data;

3. a vertical data representation in P-trees P-Trees improve the efficiency of the method further.

Definitions

dim/rr)DiskNbr(x,rxDens ||),(

Direct DiskNbr

Indirect DiskNbr

Definition 1: Disk Neighborhood --- DiskNbr(x,r)

Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’ X | d(x-x’) r}, where d(x-x’) is the distance of x and x’

Direct & indirect neighbors of x

Definition 2: Density of DiskNbr(x, r) --- Dens (x,r)

, where dim is the number of dimensions

Definitions (Continued)

),(),(

rxDiskNbr

rqDiskNbr

rxDens

rqDensAVGRDF(x,r) rxDiskNbrqrxDiskNbrq

Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r)

RDF is used to measure outlierness. Outliers are points with high RDF values.

Direct DiskNbr

Indirect DiskNbr

Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}

)12*(|),(|

|),(||)2,(|dim

rxDiskNbr

rxDiskNbrrxDiskNbrRDF(x,r)

Direct neighbor

r 2r Indirect neighborsx

The Proposed Outlier Detection Method

Given a dataset X, the proposed outlier detection method is processed by:

Find Outliers Prune Non-outliers

Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.

Prune out non-outlier

Start point x

prr 2r 4r 6r

Finding Outliers

Outliers!!!

Finding Outliers

Three possible distributions with regard to RDF:

(a) prune all neighbors, call “Pruning Non-outliers” procedure; (b) prune all direct neighbors of x, calculate RDF for each

indirect neighbor.(c) x is an outlier, prune indirect neighbors of x.

(a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε)

Finding Outliers using P-Trees P-Tree based direct neighbors --- PDNx

r For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0), (x2,m-1, …x2,0), … (xn,m-1, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, PDNxi

r = Px’>xi-r AND Px’xi+r For muti-attributes, |DiskNbr(x,r)|= rc(PDNx

P-Tree based indirect neighbors --- PINxr

PINxr = (OR q Nbr(x,r) PDNq

r) AND PDNx’r

Pruning is done by P-Trees ANDing based on the above three distributions

(a),(c): PU = PU AND PDNxr AND PINx’

(b): PU = PU AND PDNxr;

where PU is a P-tree representing unprocessed data points

Xr PDNANDPDN

1-n0,i

Pruning Non-outliers

1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.

RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process;

RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”.

The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.

Start point

xr r 2r 4r

Prune out non-outlier

Pruning Non-outliers Using P-Trees We define ξ- neighbors: it represents the neighbors with ξ bits of di

ssimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value

For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), … (xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by

jix PANDP ,

ixX PANDP 1-n0,i

The pruning is accomplished by: PU = PU AND PX

ξ’, where PXξ’ is the complement set of PX

,where

jijiix

RDF-based Outlier Detection Process

Algorithm: RDF-based Outlier Detection using P-Trees

Input: Dataset X, radius r, distribution parameter ε.

Output: An outlier set Ols.

// PU — unprocessed points represented by P-Trees;

// |PU| — number of points in PU

// PO --- outliers;

//Build up P-Trees for Dataset X

PU createP-Trees(X);

WHILE |PU| > 0 DO

x PU.first; //pick an arbitrary point x

PO FindOutliers (x, r, ε);

ENDWHILE

“Find Outliers” and “Prune Non-Outliers” Procedures

Algorithm: PruneNonOutliers Input: point x, distribution parameter ε,dataset X Output: pruned dataset PU // Pi,j is P-tree for jth bit of ith attribute of X // PNx

ξ, i-neighborhood of a point x // n is number of attributes ,m is the number of bits //in each attribute // Pxi’

i,j is complement set of Pxii,j

FOR j = 0 TO m-1 IF xi,j = 1 Pxi

,i,j Pi,j

ELSE Pxi,i,j P’i,j

ENDFOR PU 1; Px 1; ξ = 0; DO FOR i = 1 TO n Pxi Pxi

,i,1 FOR j = 0 TO m-ξ Pxi Pxi AND Pxi

,i,j+1

ENDFOR PX PX AND Pxi ENDFOR PNx

ξ Px;

ξ ξ + 1; rdf (rc(PNx

ξ)-rc (PNxξ-1

) ) / (rc(PNX,ξ-1))2

WHILE (rdf < 1/(1+ ε) || rdf > (1+ε) ) q {PNX’ξ-1AND PNX

ξ} IF rdf < 1/(1+ ε) PU PU AND PNX’ξ-1; // pruning FindOutliers (q,r,ε); ELSE IF rdf > (1+ε) FindNonOutliers(q,r,ε) ENDIF

Algorithm: FindOutliers Input: point x, radius r, distribution parameter ε Output: pruned dataset PU //PDN(x): direct neighbors of x //PIN(x): indirect neighbors of x // rdf is relative density factor PDN(x,r) = PX?x+r OR PX>x-r sum 0, PN 0; FOR each point q in PDN(x, r)

PN (q, r) = PX<q+r OR PX<q-r; sum sum + |PN(q, r)|;

ENDFOR rdf sum / (|PDN(x)|2); switch (rdf) case : 1/(1+ ε ) ? rdf ? 1+ ε PU PU AND PDN’(x) AND PIN’(x); PruneNonOutliers(x, r, ε); case rdf <1/(1+ ε ): PU PU AND PDN’(x); FOR each point q in PIN(x) FindOutliers (q, r, ε); ENDFOR case rdf > (1+ ε) // Add point x into the outlier set Ols Ols OR x; PU PU AND PIN’(x);

Experimental Study

NHL data set (1996) Compare with LOF, aLOCI

LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method

Run Time Comparison Scalability Comparison

Start from 16,384, outperform in terms of scalability and speed

Run Time (s)

Data Size

Run Time Comparisons of LOF, aLOCI, RDF

LOF 0.23 1.92 38.79 103.19 1813.43

aLOCI 0.17 1.87 35.81 87.34 985.39

RDF 0.58 2.1 8.34 37.82 108.91

256 1024 4096 16384 65536

Scalability Comparison of LOF,aLOCI,RDF

256 1024 4096 16384 65536

Data Size

ReferenceReference1. V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International

Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222.3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large

Data Bases Conference Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data

Bases Conference Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”,

International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163-5808

6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000

7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India

8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 19999. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large

Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169.10. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.11. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied

Computing, 2002.12. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.13. M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD

2002, Spriger-Verlag LNAI 2776, 200214. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 200315. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters Determination of r

Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood)

Choosing miniPts=20, get the average radius of 20-neighborhood, raverage.

In our algorithm, r = raverage=0.5 Determination of ε

Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are.

We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster.

The results shown in the experimental part is based on ε=0.8.

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren,...

Documents

Transcript of RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren,...

Pattern Recognition of Quality Control Chart with Bayesian ...cs.ndsu.edu/~perrizo/saturday/ExternalDrive/flash2/paper… · Web viewA Dissertation. Submitted to the Graduate Faculty.

A Vertical Outlier Detection Algorithm with Clusters as by-product Dongmei Ren, Imad Rahal, and William Perrizo Computer Science and Operations Research.

From Chabot To Berkeley Baoying Zhang University of California, Berkeley Chemical and Biomolecular Engineering.

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Tudian Middle School--- Zhou Dongmei. study abroad advantages improve … meet visit.

Generating Random Variatescs.ndsu.edu/~perrizo/saturday/teach/737/lawkelton_chpt8.pdf · 8.2 General Approaches to Generating Random Variates Five general approaches to generating

Descendants of Mitchell Perrizondchicago.net/geneo/perrizo/3mitch.pdf · 1 Descendants of Mitchell Perrizo Generation No. 1 1. MITCHELL7 PERRIZO (MICHEL DALPE DIT6 PARISEAU, MICHEL

Dongmei Liu, Hong Zhu and Ian Bayley 09 November 2012 Applying Algebraic Specification To Cloud Computing.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Nervous Tissue Li DongMei Website ： m-learning.zju.edu.cn.

The Effectiveness of Corrective Feedback on Chinese EFL Writers’ Grammatical Accuracy Improvement Dongmei Cheng Northern Arizona University.

Lecture 18: Security Reading: Chapter 8 (Some material thanks to William Perrizo) ? CMSC 23300/33300 Computer Networks .

Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Summary Environmental Impact Assessment · Nantong Jintan Zhenjiang Daiyang Jurong Luhe Gaoyou Baoying Changzhou Shuyang Binhai Xiangshui Pei Xian Pei Xian Chia Wang Xinghua Sihong

PhD-Program Preparation for Successful Post-PhD Careertaoxie.cs.illinois.edu/advice/preparecareer.pdf · •Software analytics (Dongmei Zhang@MSRA) •… •My own “under-qualified

ORGANIZING SOCIETIES - CICCSTciccst.org.cn/ISMS2017/program0620.pdf · ORGANIZING SOCIETIES ... Histology and Embryology, School of Medicine, ... Dongmei Cui Dongmei Cui, University

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, William Perrizo North Dakota.

Gypsum blocks produced from TiO2 production by-productsGypsum blocks produced from TiO2 production by-products Yihe Zhanga,b, Fan Wanga, Hongwei Huanga, Yuxi Guoa, Baoying Lia, Yangyang

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.