Post on 03-Jan-2016
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation
Dongmei Ren, Baoying Wang, William Perrizo
North Dakota State University, U.S.A
Introduction Related Work
Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities.
Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient.
Contributions of this paper 1. a relative density factor (RDF)
RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7]
but RDF is easier to compute; 2. RDF-based outlier detection method
it efficiently prunes the data points which are deep in clusters It detects outliers only within the remaining small subset of the data;
3. a vertical data representation in P-trees P-Trees improve the efficiency of the method further.
Definitions
dim/rr)DiskNbr(x,rxDens ||),(
Direct DiskNbr
x
Indirect DiskNbr
Definition 1: Disk Neighborhood --- DiskNbr(x,r)
Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’ X | d(x-x’) r}, where d(x-x’) is the distance of x and x’
Direct & indirect neighbors of x
Definition 2: Density of DiskNbr(x, r) --- Dens (x,r)
, where dim is the number of dimensions
Definitions (Continued)
2
),(),(
|),(|
|),(|
),(
)),((
rxDiskNbr
rqDiskNbr
rxDens
rqDensAVGRDF(x,r) rxDiskNbrqrxDiskNbrq
Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r)
RDF is used to measure outlierness. Outliers are points with high RDF values.
Direct DiskNbr
x
Indirect DiskNbr
Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}
)12*(|),(|
|),(||)2,(|dim
rxDiskNbr
rxDiskNbrrxDiskNbrRDF(x,r)
Direct neighbor
r 2r Indirect neighborsx
The Proposed Outlier Detection Method
Given a dataset X, the proposed outlier detection method is processed by:
Find Outliers Prune Non-outliers
Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.
Prune out non-outlier
Start point x
prr 2r 4r 6r
Finding Outliers
Outliers!!!
Finding Outliers
Three possible distributions with regard to RDF:
(a) prune all neighbors, call “Pruning Non-outliers” procedure; (b) prune all direct neighbors of x, calculate RDF for each
indirect neighbor.(c) x is an outlier, prune indirect neighbors of x.
(a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε)
x
Finding Outliers using P-Trees P-Tree based direct neighbors --- PDNx
r For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0), (x2,m-1, …x2,0), … (xn,m-1, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, PDNxi
r = Px’>xi-r AND Px’xi+r For muti-attributes, |DiskNbr(x,r)|= rc(PDNx
r)
P-Tree based indirect neighbors --- PINxr
PINxr = (OR q Nbr(x,r) PDNq
r) AND PDNx’r
Pruning is done by P-Trees ANDing based on the above three distributions
(a),(c): PU = PU AND PDNxr AND PINx’
r
(b): PU = PU AND PDNxr;
where PU is a P-tree representing unprocessed data points
ixr
Xr PDNANDPDN
1-n0,i
Pruning Non-outliers
1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.
RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process;
RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”.
The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.
Start point
xr r 2r 4r
Prune out non-outlier
Pruning Non-outliers Using P-Trees We define ξ- neighbors: it represents the neighbors with ξ bits of di
ssimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value
For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), … (xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by
iix
jix PANDP ,
-m1,j
ixX PANDP 1-n0,i
The pruning is accomplished by: PU = PU AND PX
ξ’, where PXξ’ is the complement set of PX
ξ
,where
0,'
1,
,,
,,
,
jiji
jijiix
ji xP
xPP
RDF-based Outlier Detection Process
Algorithm: RDF-based Outlier Detection using P-Trees
Input: Dataset X, radius r, distribution parameter ε.
Output: An outlier set Ols.
// PU — unprocessed points represented by P-Trees;
// |PU| — number of points in PU
// PO --- outliers;
//Build up P-Trees for Dataset X
PU createP-Trees(X);
i 1;
WHILE |PU| > 0 DO
x PU.first; //pick an arbitrary point x
PO FindOutliers (x, r, ε);
i i+1
ENDWHILE
“Find Outliers” and “Prune Non-Outliers” Procedures
Algorithm: PruneNonOutliers Input: point x, distribution parameter ε,dataset X Output: pruned dataset PU // Pi,j is P-tree for jth bit of ith attribute of X // PNx
ξ, i-neighborhood of a point x // n is number of attributes ,m is the number of bits //in each attribute // Pxi’
i,j is complement set of Pxii,j
FOR j = 0 TO m-1 IF xi,j = 1 Pxi
,i,j Pi,j
ELSE Pxi,i,j P’i,j
ENDFOR PU 1; Px 1; ξ = 0; DO FOR i = 1 TO n Pxi Pxi
,i,1 FOR j = 0 TO m-ξ Pxi Pxi AND Pxi
,i,j+1
ENDFOR PX PX AND Pxi ENDFOR PNx
ξ Px;
ξ ξ + 1; rdf (rc(PNx
ξ)-rc (PNxξ-1
) ) / (rc(PNX,ξ-1))2
WHILE (rdf < 1/(1+ ε) || rdf > (1+ε) ) q {PNX’ξ-1AND PNX
ξ} IF rdf < 1/(1+ ε) PU PU AND PNX’ξ-1; // pruning FindOutliers (q,r,ε); ELSE IF rdf > (1+ε) FindNonOutliers(q,r,ε) ENDIF
Algorithm: FindOutliers Input: point x, radius r, distribution parameter ε Output: pruned dataset PU //PDN(x): direct neighbors of x //PIN(x): indirect neighbors of x // rdf is relative density factor PDN(x,r) = PX?x+r OR PX>x-r sum 0, PN 0; FOR each point q in PDN(x, r)
PN (q, r) = PX<q+r OR PX<q-r; sum sum + |PN(q, r)|;
ENDFOR rdf sum / (|PDN(x)|2); switch (rdf) case : 1/(1+ ε ) ? rdf ? 1+ ε PU PU AND PDN’(x) AND PIN’(x); PruneNonOutliers(x, r, ε); case rdf <1/(1+ ε ): PU PU AND PDN’(x); FOR each point q in PIN(x) FindOutliers (q, r, ε); ENDFOR case rdf > (1+ ε) // Add point x into the outlier set Ols Ols OR x; PU PU AND PIN’(x);
Experimental Study
NHL data set (1996) Compare with LOF, aLOCI
LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method
Run Time Comparison Scalability Comparison
Start from 16,384, outperform in terms of scalability and speed
0
500
1000
1500
2000
Run Time (s)
Data Size
Run Time Comparisons of LOF, aLOCI, RDF
LOF 0.23 1.92 38.79 103.19 1813.43
aLOCI 0.17 1.87 35.81 87.34 985.39
RDF 0.58 2.1 8.34 37.82 108.91
256 1024 4096 16384 65536
Scalability Comparison of LOF,aLOCI,RDF
-200
0
200
400
600
800
1000
1200
1400
1600
1800
2000
256 1024 4096 16384 65536
Data Size
Run
Tim
e(s)
LOF
aLOCI
RDF
ReferenceReference1. V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International
Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222.3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large
Data Bases Conference Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data
Bases Conference Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”,
International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163-5808
6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000
7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India
8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 19999. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large
Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169.10. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.11. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied
Computing, 2002.12. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.13. M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD
2002, Spriger-Verlag LNAI 2776, 200214. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 200315. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003
Thank you!
Determination of Parameters Determination of r
Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood)
Choosing miniPts=20, get the average radius of 20-neighborhood, raverage.
In our algorithm, r = raverage=0.5 Determination of ε
Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are.
We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster.
The results shown in the experimental part is based on ε=0.8.