ICST Transactions on Scalable Information Systems Research ... · ICST Transactions on Scalable...

ICST Transactions on Scalable Information Systems Research Article

Advancements of Outlier Detection: A SurveyJi Zhang

Department of Mathematics and ComputingUniversity of Southern Queensland, Australia

Abstract

Outlier detection is an important research problem in data mining that aims to discover useful abnormal andirregular patterns hidden in large datasets. In this paper, we present a survey of outlier detection techniquesto reflect the recent advancements in this field. The survey will not only cover the traditional outlier detectionmethods for static and low dimensional datasets but also review the more recent developments that deal withmore complex outlier detection problems for dynamic/streaming and high-dimensional datasets.

Keywords: Data Mining, Outlier Detection, High-dimensional Datasets

Received on 29 December 2011; accepted on 18 February 2012; published on 04 February 2013

Copyright © 2013 Zhang, licensed to ICST. This is an open access article distributed under the terms of theCreative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/), which permits unlimiteduse, distribution and reproduction in any medium so long as the original work is properly cited.

doi:10.4108/trans.sis.2013.01-03.e2

1. Introduction

Outlier detection is an important research problemin data mining that aims to find objects that areconsiderably dissimilar, exceptional and inconsistentwith respect to the majority data in an inputdatabase [50]. Outlier detection, also known asanomaly detection in some literatures, has become theenabling underlying technology for a wide range ofpractical applications in industry, business, securityand engineering, etc. For example, outlier detectioncan help identify suspicious fraudulent transactionfor credit card companies. It can also be utilized toidentify abnormal brain signals that may indicate theearly development of brain cancers. Due to its inherentimportance in various areas, considerable researchefforts in outlier detection have been conducted in thepast decade. A number of outlier detection techniqueshave been proposed that use different mechanismsand algorithms. This paper presents a comprehensivereview on the major state-of-the-art outlier detectionmethods. We will cover different major categories ofoutlier detection approaches and critically evaluatetheir respective advantages and disadvantages.

In principle, an outlier detection technique canbe considered as a mapping function f that can beexpressed as f (p)→ q, where q ∈ <+. Giving a data

∗Corresponding author. Email: [email protected]

point p in the given dataset, a corresponding outlier-ness score is generated by applying the mappingfunction f to quantitatively reflect the strength ofoutlier-ness of p. Based on the mapping functionf , there are typically two major tasks for outlierdetection problem to accomplish, which leads to twocorresponding problem formulations. From the givendataset that is under study, one may want to find thetop k outliers that have the highest outlier-ness scoresor all the outliers whose outlier-ness score exceeding auser specified threshold.

The exact techniques or algorithms used in differentoutlier methods may vary significantly, which arelargely dependent on the characteristic of the datasetsto be dealt with. The datasets could be static with asmall number of attributes where outlier detection isrelatively easy. Nevertheless, the datasets could alsobe dynamic, such as data streams, and at the sametime have a large number of attributes. Dealing withthis kind of datasets is more complex by nature andrequires special attentions to the detection performance(including speed and accuracy) of the methods to bedeveloped.

Given the abundance of research literatures in thefield of outlier detection, the scope of this survey will beclearly specified first in order to facilitate a systematicsurvey of the existing outlier detection methods.After that, we will start the survey with a reviewof the conventional outlier detection techniques thatare primarily suitable for relatively low-dimensional

EAI European Alliancefor Innovation 1

ICST Transactions on Scalable Information SystemsJanuary-March 2013 | Volume 13 | Issue 01-03 | e2

http://creativecommons.org/licenses/by/3.0/

mailto:<[email protected]>

J.Zhang

static data, followed by some of the major recentadvancements in outlier detection for high-dimensionalstatic data and data streams.

2. Scope of This SurveyBefore the review of outlier detection methods ispresented, it is necessary for us to first explicitly specifythe scope of this survey. There have been a lot ofresearch work in detecting different kinds of outliersfrom various types of data where the techniques outlierdetection methods utilize differ considerably. Most ofthe existing outlier detection methods detect the so-called point outliers from vector-like data sets. This is thefocus of this review as well as of this thesis. Anothercommon category of outliers that has been investigatedis called collective outliers. Besides the vector-like data,outliers can also be detected from other types of datasuch as sequences, trajectories and graphs, etc. In thereminder of this subsection, we will discuss brieflydifferent types of outliers.

First, outliers can be classified as point outliersand collective outliers based on the number of datainstances involved in the concept of outliers.

• Point outliers. In a given set of data instances, anindividual outlying instance is termed as a pointoutlier. This is the simplest type of outliers and isthe focus of majority of existing outlier detectionschemes [28]. A data point is detected as a pointoutlier because it displays outlier-ness at its ownright, rather than together with other data points.In most cases, data are represented in vectors asin the relational databases. Each tuple containsa specific number of attributes. The principledmethod for detecting point outliers from vector-type data sets is to quantify, through some outlier-ness metrics, the extent to which each single datais deviated from the other data in the data set.

• Collective outliers. A collective outlier representsa collection of data instances that is outlying withrespect to the entire data set. The individual datainstance in a collective outlier may not be outlierby itself, but the joint occurrence as a collectionis anomalous [28]. Usually, the data instances ina collective outlier are related to each other. Atypical type of collective outliers are sequenceoutliers, where the data are in the format of anordered sequence.

Outliers can also be categorized into vector outliers,sequence outliers, trajectory outliers and graph outliers,etc, depending on the types of data from where outlierscan be detected.

• Vector outliers. Vector outliers are detected fromvector-like representation of data such as the

relational databases. The data are presented intuples and each tuple has a set of associatedattributes. The data set can contain only numericattributes, or categorical attributes or both. Basedon the number of attributes, the data set canbe broadly classified as low-dimensional dataand high-dimensional data, even though there isnot a clear cutoff between these two types ofdata sets. As relational databases still representthe mainstream approaches for data storage,therefore, vector outliers are the most commontype of outliers we are dealing with.

• Sequence outliers. In many applications, data arepresented as a sequence. A good example of asequence database is the computer system calllog where the computer commands executed, in acertain order, are stored. A sequence of commandsin this log may look like the following sequence:http-web, buffer-overflow, http-web, http-web, smtp-mail, ftp, http-web, ssh. Outlying sequence ofcommands may indicate a malicious behaviorthat potentially compromises system security. Inorder to detect abnormal command sequences,normal command sequences are maintained andthose sequences that do not match any normalsequences are labeled sequence outliers. Sequenceoutliers are a form of collective outlier.

• Trajectory outliers. Recent improvements insatellites and tracking facilities have made itpossible to collect a huge amount of trajectorydata of moving objects. Examples include vehiclepositioning data, hurricane tracking data, andanimal movement data [65]. Unlike a vector ora sequence, a trajectory is typically representedby a set of key features for its movement,including the coordinates of the starting andending points; the average, minimum, andmaximum values of the directional vector; andthe average, minimum, and maximum velocities.Based on this representation, a weighted-sumdistance function can be defined to computethe difference of trajectory based on the keyfeatures for the trajectory [60]. A more recentwork proposed a partition-and-detect frameworkfor detecting trajectory outliers [65]. The ideaof this method is that it partitions the wholetrajectory into line segments and tries to detectoutlying line segments, rather than the wholetrajectory. Trajectory outliers can be point outliersif we consider each single trajectory as thebasic data unit in the outlier detection. However,if the moving objects in the trajectory areconsidered, then an abnormal sequence of suchmoving objects (constituting the sub-trajectory) isa collective outlier.



Advancements of Outlier Detection: A Survey

• Graph outliers. Graph outliers represent thosegraph entities that are abnormal when comparedwith their peers. The graph entities that canbecome outliers include nodes, edges and sub-graphs. For example, Sun et al. investigate thedetection of anomalous nodes in a bipartite graph[84][85]. Autopart detects outlier edges in ageneral graph [27]. Noble et al. study anomalydetection on a general graph with labeled nodesand try to identify abnormal substructure in thegraph [72]. Graph outliers can be either pointoutliers (e.g., node and edge outliers) or collectiveoutliers (e.g., sub-graph outliers).

Unless otherwise stated, all the outlier detectionmethods discussed in this review refer to those methodsfor detecting point outliers from vector-like data sets.

3. Related Work4. Outlier Detection Methods for Low DimensionalDataThe earlier research work in outlier detection mainlydeals with static datasets with relatively low dimen-sions. Literature on these work can be broadly classifiedinto four major categories based on the techniques theyused, i.e., statistical methods, distance-based methods,density-based methods and clustering-based methods.

4.1. Statistical Detection MethodsStatistical outlier detection methods [23, 47] rely onthe statistical approaches that assume a distribution orprobability model to fit the given dataset. Under thedistribution assumed to fit the dataset, the outliers arethose points that do not agree with or conform to theunderlying model of the data.

The statistical outlier detection methods can bebroadly classified into two categories, i.e., the para-metric methods and the non-parametric methods. Themajor differences between these two classes of methodslie in that the parametric methods assume the under-lying distribution of the given data and estimate theparameters of the distribution model from the givendata [34] while the non-parametric methods do notassume any knowledge of distribution characteristics[31].

Statistical outlier detection methods (parametric andnon-parametric) typically take two stages for detectingoutliers, i.e., the training stage and test stage.

• Training stage. The training stage mainlyinvolves fitting a statistical model or buildingdata profiles based on the given data. Statisticaltechniques can be performed in a supervised,semi-supervised, and unsupervised manner.Supervised techniques estimate the probability

density for normal instances and outliers.Semi-supervised techniques estimate theprobability density for either normal instances, oroutliers, depending on the availability of labels.Unsupervised techniques determine a statisticalmodel or profile which fits all or the majority ofthe instances in the given data set;

• Test stage. Once the probabilistic model or profileis constructed, the next step is to determine if agiven data instance is an outlier with respect tothe model/profile or not. This involves computingthe posterior probability of the test instance tobe generated by the constructed model or thedeviation from the constructed data profile. Forexample, we can find the distance of the datainstance from the estimated mean and declare anypoint above a threshold to be an outlier [42].

Parametric Methods. Parametric statistical outlier detec-tion methods explicitly assume the probabilistic ordistribution model(s) for the given data set. Modelparameters can be estimated using the training databased upon the distribution assumption. The majorparametric outlier detection methods include Gaussianmodel-based and regression model-based methods.A. Gaussian Models

Detecting outliers based on Gaussian distributionmodels have been intensively studied. The trainingstage typically performs estimation of the mean andvariance (or standard deviation) of the Gaussian distri-bution using Maximum Likelihood Estimates (MLE). Toensure that the distribution assumed by human users isthe optimal or close-to-optima underlying distributionthe data fit, statistical discordany tests are normallyconducted in the test stage [23][16][18]. So far, over onehundred discordancy/outlier tests have been developedfor different circumstances, depending on the parame-ter of dataset (such as the assumed data distribution)and parameter of distribution (such as mean and vari-ance), and the expected number of outliers [50][58]. Therationale is that some small portion of points that havesmall probability of occurrence in the population areidentified as outliers. The commonly used outlier testsfor normal distributions are the mean-variance test andbox-plot test [66][49][83][44]. In the mean-variance testfor a Gaussian distribution N (µ, σ2), where the popula-tion has a mean µ and variance σ , outliers can be consid-ered to be points that lie 3 or more standard deviations(i.e., ≥ 3σ ) away from the mean [41]. This test is generaland can be applied to some other commonly useddistributions such as Student t-distribution and Poissondistribution, which feature a fatter tail and a longerright tail than a normal distribution, respectively. Thebox-plot test draws on the box plot to graphically depictthe distribution of data using five major attributes, i.e.,smallest non-outlier observation (min), lower quartile



J.Zhang

(Q1), median, upper quartile (Q3), and largest non-outlier observation (max). The quantity Q3-Q1 is calledthe Inter Quartile Range (IQR). IQR provides a meansto indicate the boundary beyond which the data will belabeled as outliers; a data instance will be labeled as anoutlier if it is located 1.5*IQR times lower than Q1 or1.5*IQR times higher than Q3.

In some cases, a mixture of probabilistic models maybe used if a single model is not sufficient for the purposeof data modeling. If labeled data are available, twoseparate models can be constructed, one for the normaldata and another for the outliers. The membershipprobability of the new instances can be quantifiedand they are labeled as outliers if their membershipprobability of outlier probability model is higher thanthat of the model of the normal data. The mixture ofprobabilistic models can also be applied to unlabeleddata, that is, the whole training data are modeled usinga mixture of models. A test instance is considered to bean outlier if it is found that it does not belong to any ofthe constructed models.

B. Regression Models

If the probabilistic model is unknown regression canbe employed for model construction. The regressionanalysis aims to find a dependence of one/morerandom variable(s) Y on another one/more variable(s)X . This involves examining the conditional probabilitydistribution Y|X . Outlier detection using regressiontechniques are intensively applied to time-seriesdata [4][2][39][1][64]. The training stage involvesconstructing a regression model that fits the data. Theregression model can either be a linear or non-linearmodel, depending on the choice from users. The teststage tests the regression model by evaluating eachdata instance against the model. More specifically, suchtest involves comparing the actual instance value andits projected value produced by the regression model.A data point is labeled as an outlier if a remarkabledeviation occurs between the actual value and itsexpected value produced by the regression model.

Basically speaking, there are two ways to use the datain the dataset for building the regression model foroutlier detection, namely the reverse search and directsearch methods. The reverse search method constructsthe regression model by using all data available andthen the data with the greatest error are consideredas outliers and excluded from the model. The directsearch approach constructs a model based on a portionof data and then adds new data points incrementallywhen the preliminary model construction has beenfinished. Then, the model is extended by adding mostfitting data, which are those objects in the rest ofthe population that have the least deviations from themodel constructed thus far. The data added to the

model in the last round, considered to be the leastfitting data, are regarded to be outliers.

Non-parametric Methods. The outlier detection tech-niques in this category do not make any assumptionsabout the statistical distribution of the data. The mostpopular approaches for outlier detection in this cate-gory are histograms and Kernel density function meth-ods.

A. HistogramsThe most popular non-parametric statistical tech-

nique is to use histograms to maintain a profile ofdata. Histogram techniques by nature are based on thefrequency or counting of data.

The histogram based outlier detection approach istypically applied when the data has a single feature.Mathematically, a histogram for a feature of dataconsists of a number of disjoint bins (or buckets)and the data are mapped into one (and only one)bin. Represented graphically by the histogram graph,the height of bins corresponds to the number ofobservations that fall into the bins. Thus, if we letn be the total number of instances, k be the totalnumber of bins and mi be the number of data pointin the ith bin (1 ≤ i ≤ k), the histogram satisfies thefollowing condition n =

∑ki=1mi . The training stage

involves building histograms based on the differentvalues taken by that feature in the training data.

The histogram techniques typically define a measurebetween a new test instance and the histogram basedprofile to determine if it is an outlier or not. Themeasure is defined based on how the histogram isconstructed in the first place. Specifically, there arethree possible ways for building a histogram:

1. The histogram can be constructed only based onnormal data. In this case, the histogram onlyrepresents the profile for normal data. The teststage evaluates whether the feature value in thetest instance falls in any of the populated bins ofthe constructed histogram. If not, the test instanceis labeled as an outlier [5] [54][48];

2. The histogram can be constructed only basedon outliers. As such, the histogram captures theprofile for outliers. A test instance that falls intoone of the populated bins is labeled as an outlier[32]. Such techniques are particularly popular inintrusion detection community [34][38] [30] andfraud detection [40];

3. The histogram can be constructed based on amixture of normal data and outliers. This isthe typical case where histogram is constructed.Since normal data typically dominate the wholedata set, thus the histogram represents anapproximated profile of normal data. The sparsity




of a bin in the histogram can be defined as theratio of frequency of this bin against the averagefrequency of all the bins in the histogram. A binis considered as sparse if such ratio is lower thana user-specified threshold. All the data instancefalling into the sparse bins are labeled as outliers.

The first and second ways for constructing histogram,as presented above, rely on the availability of labeledinstances, while the third one does not.

For multivariate data, a common approach is toconstruct feature-wise histograms. In the test stage,the probability for each feature value of the test datais calculated and then aggregated to generate the so-called outlier score. A low probability value correspondsa higher outlier score of that test instance. Theaggregation of per-feature likelihoods for calculatingoutlier score is typically done using the followingequation:

Outlier_Score =∑f ∈F

wf · (1 − pf )/ |F|

where wf denotes the weight assigned for featuref , pf denotes the probability for the value of fea-ture f and F denotes the set of features of thedataset. Such histogram-based aggregation techniqueshave been used in intrusion detection in system calldata [35], fraud detection [40], damage detection instructures [67] [70] [71], network intrusion detec-tion [90] [91], web-based attack detection [63], PacketHeader Anomaly Detection (PHAD), Application LayerAnomaly Detection (ALAD) [69], NIDES (by SRI Inter-national) [5] [12] [79]. Also, a substantial amount ofresearch has been done in the field of outlier detectionfor sequential data (primarily to detect intrusions incomputer system call data) using histogram based tech-niques. These techniques are fundamentally similar tothe instance based histogram approaches as describedabove but are applied to sequential data to detect col-lective outliers.

Histogram based detection methods are simple toimplement and hence are quite popular in domainsuch as intrusion detection. But one key shortcomingof such techniques for multivariate data is that they arenot able to capture the interactions between differentattributes. An outlier might have attribute values thatare individually very frequent, but their combination isvery rare. This shortcoming will become more salientwhen dimensionality of data is high. A feature-wisehistogram technique will not be able to detect suchkinds of outliers. Another challenge for such techniquesis that users need to determine an optimal size of thebins to construct the histogram.

B. Kernel FunctionsAnother popular non-parametric approach for out-

lier detection is the parzen windows estimation due

to Parzen [76]. This involves using Kernel functionsto approximate the actual density distribution. A newinstance which lies in the low probability area of thisdensity is declared to be an outlier.

Formally, if x1, x2, ..., xN are IID (independently andidentically distributed) samples of a random variable x,then the Kernel density approximation of its probabilitydensity function (pdf) is

fh(x) =1Nh

N∑i=1

K(x − xih

)

where K is Kernel function and h is the bandwidth(smoothing parameter). Quite often, K is taken to bea standard Gaussian function with mean µ = 0 andvariance σ2 = 1:

K(x) =1√

2πe−

12 x

2

Novelty detection using Kernel function is presentedby [17] for detecting novelties in oil flow data. A testinstance is declared to be novel if it belongs to thelow density area of the learnt density function. Similarapplication of parzen windows is proposed for networkintrusion detection [29] and for mammographic imageanalysis [86]. A semi-supervised probabilistic approachis proposed to detect novelties [31]. Kernel functionsare used to estimate the probability distributionfunction (pdf) for the normal instances. Recently,Kernel functions are used in outlier detection in sensornetworks [80][25].

Kernel density estimation of pdf is applicableto both univariate and multivariate data. However,the pdf estimation for multivariate data is muchmore computationally expensive than the univariatedata. This renders the Kernel density estimationmethods rather inefficient in outlier detection for high-dimensional data.

Advantages and Disadvantages of Statistical Methods.Statistical outlier detection methods feature someadvantages. They are mathematically justified and ifa probabilistic model is given, the methods are veryefficient and it is possible to reveal the meaning of theoutliers found [75]. In addition, the model constructed,often presented in a compact form, makes it possible todetect outliers without storing the original datasets thatare usually of large sizes.

However, the statistical outlier detection methods,particularly the parametric methods, suffer from somekey drawbacks. First, they are typically not applied ina multi-dimensional scenario because most distributionmodels typically apply to the univariate featurespace. Thus, they are unsuitable even for moderatemulti-dimensional data sets. This greatly limits theirapplicability as in most practical applications the data



J.Zhang

is multiple or even high dimensional. In addition, alack of the prior knowledge regarding the underlyingdistribution of the dataset makes the distribution-based methods difficult to use in practical applications.A single distribution may not model the entiredata because the data may originate from multipledistributions. Finally, the quality of results cannotbe guaranteed because they are largely dependenton the distribution chosen to fit the data. It isnot guaranteed that the data being examined fitthe assumed distribution if there is no estimate ofthe distribution density based on the empirical data.Constructing such tests for hypothesis verification incomplex combinations of distributions is a nontrivialtask whatsoever. Even if the model is properly chosen,finding the values of parameters requires complexprocedures. From above discussion, we can see thestatistical methods are rather limited to large real-world databases which typically have many differentfields and it is not easy to characterize the multivariatedistribution of exemplars.

For non-parametric statistical methods, such ashistogram and Kernal function methods, they do nothave the problem of distribution assumption that theparametric methods suffer and they both can deal withdata streams containing continuously arriving data.However, they are not appropriate for handling high-dimensional data. Histogram methods are effectivefor a single feature analysis, but they lose muchof their effectiveness for multi or high-dimensionaldata because they lack the ability to analyze multiplefeature simultaneously. This prevents them fromdetecting subspace outliers. Kernel function methodsare appropriate only for relatively low dimensionaldata as well. When the dimensionality of data ishigh, the density estimation using Kernel functionsbecomes rather computationally expensive, makingit inappropriate for handling high-dimensional datastreams.

4.2. Distance-based MethodsThere have already been a number of different waysfor defining outliers from the perspective of distance-related metrics. Most existing metrics used for distance-based outlier detection techniques are defined basedupon the concepts of local neighborhood or k nearestneighbors (kNN) of the data points. The notion ofdistance-based outliers does not assume any underlyingdata distributions and generalizes many concepts fromdistribution-based methods. Moreover, distance-basedmethods scale better to multi-dimensional space andcan be computed much more efficiently than thestatistical-based methods.

In distance-based methods, distance between datapoints is needed to be computed. We can use any of

the Lp metrics like the Manhattan distance or Euclideandistance metrics for measuring the distance between apair of points. Alternately, for some other applicationdomains with presence of categorical data (e.g., textdocuments), non-metric distance functions can also beused, making the distance-based definition of outliersvery general. Data normalization is normally carriedout in order to normalize the different scales of datafeatures before outlier detection is performed.

A. Local Neighborhood MethodsThe first notion of distance-based outliers, called

DB(k, λ)-Outlier, is due to Knorr and Ng [58]. It isdefined as follows. A point p in a data set is a DB(k, λ)-Outlier, with respect to the parameters k and λ, if nomore than k points in the data set are at a distance λor less (i.e., λ−neighborhood) from p. This definitionof outliers is intuitively simple and straightforward.The major disadvantage of this method, however, is itssensitivity to the parameter λ that is difficult to specifya priori. As we know, when the data dimensionalityincreases, it becomes increasingly difficult to specify anappropriate circular local neighborhood (delimited byλ) for outlier-ness evaluation of each point since mostof the points are likely to lie in a thin shell about anypoint [19]. Thus, a too small λ will cause the algorithmto detect all points as outliers, whereas no point willbe detected as outliers if a too large λ is picked up. Inother words, one needs to choose an appropriate λ witha very high degree of accuracy in order to find a modestnumber of points that can then be defined as outliers.

To facilitate the choice of parameter values, this firstlocal neighborhood distance-based outlier definitionis extended and the so-called DB(pct, dmin)-Outlieris proposed which defines an object in a datasetas a DB(pct, dmin)-Outlier if at least pct% of theobjects in the datasets have the distance larger thandmin from this object [59][60]. Similar to DB(k, λ)-Outlier, this method essentially delimits the localneighborhood of data points using the parameter dminand measures the outlierness of a data point based onthe percentage, instead of the absolute number, of datapoints falling into this specified local neighborhood.As pointed out in [56] and [57], DB(pct, dmin) isquite general and is able to unify the exisitingstatisical detection methods using discordancy tests foroutlier detection. For exmaple, DB(pct, dmin) unifies thedefinition of outliers using a normal distribution-baseddiscordancy test with pct = 0.9988 and dmin = 0.13. Thespecification of pct is obviously more intuitive andeasier than the specification of k in DB(k, λ)-Outliers[59]. However, DB(pct, dmin)-Outlier suffers a similarproblem as DB(pct, dmin)-Outlier in specifying the localneighborhood parameter dmin.

To efficiently calculate the number (or percentage)of data points falling into the local neighborhood




of each point, three classes of algorithms have beenpresented, i.e., the nested-loop, index-based and cell-based algorithms. For easy of presentation, these threealgorithms are discussed for detecting DB(k, λ)-Outlier.

The nested-loop algorithm uses two nested loops tocompute DB(k, λ)-Outlier. The outer loop considerseach point in the dataset while the inner loop computesfor each point in the outer loop the number (orpercentage) of points in the dataset falling into thespecified λ-neighborhood. This algorithm has theadvantage that it does not require the indexingstructure be constructed at all that may be ratherexpensive at most of the time, though it has a quadraticcomplexity with respect to the number of points in thedataset.

The index-based algorithm involves calculating thenumber of points belonging to the λ-neighborhoodof each data by intensively using a pre-constructedmulti-dimensional index structure such as R∗-tree[22] to facilitate kNN search. The complexity of thealgorithm is approximately logarithmic with respect tothe number of the data points in the dataset. However,the construction of index structures is sometimesvery expensive and the quality of the index structureconstructed is not easy to guarantee.

In the cell-based algorithm, the data space ispartitioned into cells and all the data points are mappedinto cells. By means of the cell size that is known apriori, estimates of pair-wise distance of data pointsare developed, whereby heuristics (pruning properties)are presented to achieve fast outlier detection. It isshown that three passes over the dataset are sufficientfor constructing the desired partition. More precisely,the d−dimensional space is partitioned into cells withside length of λ

2√d

. Thus, the distance between pointsin any 2 neighboring cells is guaranteed to be at mostλ. As a result, if for a cell the total number of pointsin the cell and its neighbors is greater than k, thennone of the points in the cell can be outliers. Thisproperty is used to eliminate the vast majority of pointsthat cannot be outliers. Also, points belonging to cellsthat are more than 3 cells apart are more than adistance λ apart. As a result, if the number of pointscontained in all cells that are at most 3 cells awayfrom the a given cell is less than k, then all pointsin the cell are definitely outliers. Finally, for thosepoints that belong to a cell that cannot be categorizedas either containing only outliers or only non-outliers,only points from neighboring cells that are at most 3cells away need to be considered in order to determinewhether or not they are outliers. Based on the aboveproperties, the authors propose a three-pass algorithmfor computing outliers in large databases. The timecomplexity of this cell-based algorithm is O(cd +N ),where c is a number that is inversely proportional to

λ. This complexity is linear with dataset size N butexponential with the number of dimensions d. As aresult, due to the exponential growth in the numberof cells as the number of dimensions is increased, thecell-based algorithm starts to perform poorly than thenested loop for datasets with dimensions of 4 or higher.

In [36], a similar definition of outlier is proposed.It calculates the number of points falling into thew-radius of each data point and labels those pointsas outliers that have low neighborhood density. Weconsider this definition of outliers as the same as thatfor DB(k, λ)-Outlier, differing only that this methoddoes not present the threshold k explicitly in thedefinition. As the computation of the local densityfor each point is expensive, [36] proposes a clusteringmethod for an efficient estimation. The basic idea ofsuch approximation is to use the size of a cluster toapproximate the local density of all the data in thiscluster. It uses the fix-width clustering [36] for densityestimation due to its good efficiency in dealing withlarge data sets.

B. kNN-distance MethodsThere have also been a few distance-based outlier

detection methods utilizing the k nearest neighbors(kNN) in measuring the outlier-ness of data points inthe dataset. The first proposal uses the distance to thekth nearest neighbors of every point, denoted as Dk ,to rank points so that outliers can be more efficientlydiscovered and ranked [81]. Based on the notion ofDk , the following definition for Dkn-Outlier is given:Given k and n, a point is an outlier if the distance toits kth nearest neighbor of the point is smaller thanthe corresponding value for no more than n − 1 otherpoints. Essentially, this definition of outliers considersthe top n objects having the highest Dk values in thedataset as outliers.

Similar to the computation of DB(k, λ)-Outlier, threedifferent algorithms, i.e., the nested-loop algorithm,the index-based algorithm, and the partition-basedalgorithm, are proposed to compute Dk for each datapoint efficiently.

The nested-loop algorithm for computing outlierssimply computes, for each input point p, Dk , thedistance of between p and its kth nearest neighbor. Itthen sorts the data and selects the top n points with themaximumDk values. In order to computeDk for points,the algorithm scans the database for each point p. For apoint p, a list of its k nearest points is maintained, andfor each point q from the database which is considered,a check is made to see if the distance between p and qis smaller than the distance of the kth nearest neighborfound so far. If so, q is included in the list of thek nearest neighbors for p. The moment that the listcontains more than k neighbors, then the point thatis furthest away from p is deleted from the list. In



J.Zhang

this algorithm, since only one point is processed at atime, the database would need to be scanned N times,where N is the number of points in the database. Thecomputational complexity is in the order of O(N2),which is rather expensive for large datasets. However,since we are only interested in the top n outliers, we canapply the following pruning optimization to early-stopthe computation ofDk for a point p. Assume that duringeach step of the algorithm, we store the top n outlierscomputed thus far. Let Dnmin be the minimum amongthese top n outliers. If during the computation of for anew point p, we find that the value for Dk computedso far has fallen below Dnmin, we are guaranteed thatpoint p cannot be an outlier. Therefore, it can be safelydiscarded. This is because Dk monotonically decreasesas we examine more points. Therefore, p is guaranteednot to be one of the top n outliers.

The index-based algorithm draws on index structuresuch as R*-tree [22] to speed up the computation. If wehave all the points stored in a spatial index like R*-tree,the following pruning optimization can be applied toreduce the number of distance computations. Supposethat we have computed for point p by processinga portion of the input points. The value that wehave is clearly an upper bound for the actual Dk

of p. If the minimum distance between p and theMinimum Bounding Rectangles (MBR) of a node in theR*-tree exceeds the value that we have anytime in thealgorithm, then we can claim that none of the points inthe sub-tree rooted under the node will be among thek nearest neighbors of p. This optimization enables usto prune entire sub-trees that do not contain relevantpoints to the kNN search for p.

The major idea underlying the partition-based algo-rithm is to first partition the data space, and then prunepartitions as soon as it can be determined that theycannot contain outliers. Partition-based algorithm issubject to the pre-processing step in which data spaceis split into cells and data partitions, together with theMinimum Bounding Rectangles of data partitions, aregenerated. Since n will typically be very small, thisadditional preprocessing step performed at the gran-ularity of partitions rather than points is worthwhileas it can eliminate a significant number of points asoutlier candidates. This partition-based algorithm takesthe following four steps:

1. First, a clustering algorithm, such as BIRCH, isused to cluster the data and treat each cluster asa separate partition;

2. For each partition P , the lower and upper bounds(denoted as P .lower and P .upper, respectively)on Dk for points in the partition are computed.For every point p ∈ P , we have P .lower ≤ Dk(p) ≤P .upper;

3. The candidate partitions, the partitions containingpoints which are candidates for outliers, are iden-tified. Suppose we could computeminDkDist, thelower bound on Dk for the n outliers we havedetected so far. Then, if P .upper < minDkDist,none of the points in P can possibly be outliersand are safely pruned. Thus, only partitions Pfor which P .upper ≥ minDkDist are chosen ascandidate partitions;

4. Finally, the outliers are computed from among thepoints in the candidate partitions obtained in Step3. For each candidate partition P , let P .neighborsdenote the neighboring partitions of P , which areall the partitions within distance P .upper from P .Points belonging to neighboring partitions of Pare the only points that need to be examined whencomputing Dk for each point in P .

The Dkn-Outlier is further extended by consideringfor each point the sum of its k nearest neighbors[10]. This extension is motivated by the fact that thedefinition of Dk merely considers the distance betweenan object with its kth nearest neighbor, entirely ignoringthe distances between this object and its another k − 1nearest neighbors. This drawback may make Dk fail togive an accurate measurement of outlier-ness of datapoints in some cases. For a better understanding, wepresent an example, as shown in Figure 1, in whichthe same Dk value is assigned to points p1 and p2, twopoints with apparently rather different outlier-ness. Thek − 1 nearest neighbors for p2 are populated much moredensely around it than those of p1, thus the outlier-ness of p2 is obviously lower than p1. Obviously, Dk isnot robust enough in this example to accurately revealthe outlier-ness of data points. By summing up thedistances between the object with all of its k nearestneighbors, we will be able to have a more accuratemeasurement of outlier-ness of the object, though thiswill require more computational effort in summing upthe distances. This method is also used in [36] foranomaly detection.

The idea of kNN-based distance metric can beextended to consider the k nearest dense regions. Therecent methods are the Largest_cluster method [61][98]and Grid-ODF [89], as discussed below.

Khoshgoftaar et al. propose a distance-based methodfor labeling wireless network traffic records in the datastream used as either normal or intrusive [61][98]. Letd be the largest distance of an instance to the centriodof the largest cluster. Any instance or cluster that has adistance greater than αd (α ≥ 1) to the largest cluster isdefined as an attack. This method is referred to as theLargest_Cluster method. It can also be used to detectoutliers. It takes the following several steps for outlierdetection:




Figure 1. Points with the same Dk value but different outlier-ness

1. Find the largest cluster, i.e. the cluster with largestnumber of instances, and label it as normal. Let c0be the centriod of this cluster;

2. Sort the remaining clusters in ascending orderbased on the distance from their cluster centroidto c0;

3. Label all the instances that have a distance to c0greater than αd, where α is a human-specifiedparameter;

4. Label all the other instances as normal.

When used in dealing with projected anomaliesdetection for high-dimensional data streams, thismethod suffers the following limitations:

• First and most importantly, this method does nottake into account the nature of outliers in high-dimensional data sets and is unable to exploresubspaces to detect projected outliers;

• k-means clustering is used in this method as thebackbone enabling technique for detecting intru-sions. This poses difficulty for this method to dealwith data streams. k-means clustering requiresiterative optimization of clustering centroids togradually achieve better clustering results. Thisoptimization process involves multiple data scans,which is infeasible in the context of data streams;

• A strong assumption is made in this method thatall the normal data will appear in a single cluster(i.e., the largest cluster), which is not properlysubstantiated in the paper. This assumption maybe too rigid in some applications. It is possiblethat the normal data are distributed in two ormore clusters that correspond to a few varyingnormal behaviors. For a simple instance, thenetwork traffic volume is usually high during

the daytime and becomes low late in the night.Thus, network traffic volume may display severalclusters to represent behaviors exhibiting atdifferent time of the day. In such case, the largestcluster is apparently not where all the normalcases are only residing;

• In this method, one needs to specify the parameterα. The method is rather sensitive to this parameterwhose best value is not obvious whatsoever.First, the distance scale between data will berather different in various subspaces; the distancebetween any pair of data is naturally increasedwhen it is evaluated in a subspace with higherdimension, compared to in a lower-dimensionalsubspace. Therefore, specifying an ad-hoc α valuefor each subspace evaluated is rather tedious anddifficult. Second, α is also heavily affected bythe number of clusters the clustering methodproduces, i.e., k. Intuitively, when the number ofclusters k is small, D will become relatively large,then α should be set relatively small accordingly,and vice versa.

Recently, an extension of the notion of kNN, calledGrid-ODF, from the k nearest objects to the k nearestdense regions is proposed [89]. This method employedthe sum of the distances between each data point andits k nearest dense regions to rank data points. Thisenables the algorithm to measure the outlier-ness ofdata points from a more global perspective. Grid-ODFtakes into account the mechanisms used in detectingboth global and local outliers. In the local perspective,human examine the point’s immediate neighborhoodand consider it as an outlier if its neighborhood densityis low. The global observation considers the denseregions where the data points are densely populatedin the data space. Specifically, the neighboring densityof the point serves as a good indicator of its outlying



J.Zhang

degree from the local perspective. In the left sub-figureof Figure 2, two square boxes of equal size are used todelimit the neighborhood of points p1 and p2. Becausethe neighboring density of p1 is less than that of p2,so the outlying degree of p1 is larger than p2. Onthe other hand, the distance between the point andthe dense regions reflects the similarity between thispoint and the dense regions. Intuitively, the larger suchdistance is, the more remarkably p is deviated from themain population of the data points and therefore thehigher outlying degree it has, otherwise it is not. In theright sub-figure of 2, we can see a dense region andtwo outlying points, p1 and p2. Because the distancebetween p1 and the dense region is larger than thatbetween p2 and the dense region, so the outlying degreeof p1 is larger than p2.

Based on the above observations, a new measurementof outlying factor of data points, called Outlying DegreeFactor (ODF), is proposed to measure the outlier-ness ofpoints from both the global and local perspectives. TheODF of a point p is defined as follows:

ODF(p) =k_DF(p)NDF(p)

where k_DF(p) denotes the average distance betweenp and its k nearest dense cells and NDF(p) denotesnumber of points falling into the cell to which pbelongs.

In order to implement the computation of ODF ofpoints efficiently, grid structure is used to partitionthe data space. The main idea of grid-based dataspace partition is to super-impose a multi-dimensionalcube in the data space, with equal-volumed cells.It is characterized by the following advantages.First, NDF(p) can be obtained instantly by simplycounting the number of points falling into the cellto which p belongs, without the involvement of anyindexing techniques. Secondly, the dense regions can beefficiently identified, thus the computation of k_DF(p)can be very fast. Finally, based on the density of gridcells, we will be able to select the top n outliers onlyfrom a specified number of points viewed as outliercandidates, rather than the whole dataset, and the finaltop n outliers are selected from these outlier candidatesbased on the ranking of their ODF values.

The number of outlier candidates is typically 9 or10 times as large as the number of final outliers to befound (i.e., top n) in order to provide a sufficiently largepool for outlier selection. Let us suppose that the sizeof outlier candidates is m ∗ n, where the m is a positivenumber provided by users. To generate m ∗ n outliercandidates, all the cells containing points are sorted inascending order based on their densities, and then thepoints in the first t cells in the sorting list that satisfythe following inequality are selected as the m ∗ n outlier

candidates:

t−1∑i=1

Den(Ci) ≤ m ∗ n ≤t∑i=1

Den(Ci)

The kNN-distance methods, which define the top nobjects having the highest values of the correspondingoutlier-ness metrics as outliers, are advantageous overthe local neighborhood methods in that they order thedata points based on their relative ranking, rather thanon the distance cutoff. Since the value of n, the topoutlier users are interested in, can be very small andis relatively independent of the underlying data set, itwill be easier for the users to specify compared to thedistance threshold λ.

C. Advantages and Disadvantages of Distance-basedMethods

The major advantage of distance-based algorithmsis that, unlike distribution-based methods, distance-based methods are non-parametric and do not rely onany assumed distribution to fit the data. The distance-based definitions of outliers are fairly straightforwardand easy to understand and implement.

Their major drawback is that most of them arenot effective in high-dimensional space due to thecurse of dimensionality, though one is able tomechanically extend the distance metric, such asEuclidean distance, for high-dimensional data. Thehigh-dimensional data in real applications are verynoisy, and the abnormal deviations may be embeddedin some lower-dimensional subspaces that cannot beobserved in the full data space. Their definitionsof a local neighborhood, irrespective of the circularneighborhood or the k nearest neighbors, do not makemuch sense in high-dimensional space. Since eachpoint tends to be equi-distant with each other asnumber of dimensions goes up, the degree of outlier-ness of each points are approximately identical andsignificant phenomenon of deviation or abnormalitycannot be observed. Thus, none of the data points canbe viewed outliers if the concepts of proximity areused to define outliers. In addition, neighborhood andkNN search in high-dimensional space is a non-trivialand expensive task. Straightforward algorithms, such asthose based on nested loops, typically require O(N2)distance computations. This quadratic scaling meansthat it will be very difficult to mine outliers as we tackleincreasingly larger data sets. This is a major problem formany real databases where there are often millions ofrecords. Thus, these approaches lack a good scalabilityfor large data set. Finally, the existing distance-basedmethods are not able to deal with data streams due tothe difficulty in maintaining a data distribution in thelocal neighborhood or finding the kNN for the data inthe stream.




Figure 2. Local and global perspectives of outlier-ness of p1 and p2

4.3. Density-based MethodsDensity-based methods use more complex mechanismsto model the outlier-ness of data points than distance-based methods. It usually involves investigating notonly the local density of the point being studied but alsothe local densities of its nearest neighbors. Thus, theoutlier-ness metric of a data point is relative in the sensethat it is normally a ratio of density of this point againstthe the averaged densities of its nearest neighbors.Density-based methods feature a stronger modelingcapability of outliers but require more expensivecomputation at the same time. What will be discussedin this subsection are the major density-based methodscalled LOF method, COF method, INFLO method andMDEF method.

A. LOF MethodThe first major density-based formulation scheme of

outlier has been proposed in [21], which is more robustthan the distance-based outlier detection methods. Anexample is given in [21] (refer to figure 3), showingthe advantage of a density-based method over thedistance-based methods such as DB(k, λ)-Outlier. Thedataset contains an outlier o, and C1 and C2 are twoclusters with very different densities. The DB(k, λ)-Outlier method cannot distinguish o from the rest of thedata set no matter what values the parameters k and λtake. This is because the density of o’s neighborhood isvery much closer to the that of the points in cluster C1.However, the density-based method, proposed in [21],can handle it successfully.

This density-based formulation quantifies the outly-ing degree of points using Local Outlier Factor (LOF).Given parameter MinP ts, LOF of a point p is defined as

LOFMinP ts(p) =

∑o∈MinP ts(p)

lrdMinP ts(o)lrdMinP ts(p)

|NMinP ts(p)|

where |NMinP ts(p)| denotes the number of objects fallinginto the MinP ts-neighborhood of p and lrdMinP ts(p)denotes the local reachability density of point p thatis defined as the inverse of the average reachability

distance based on the MinP ts nearest neighbors of p,i.e.,

lrdMinP ts(p) = 1/(∑

o∈MinP ts(p) reach_distMinP ts(p, o)

|NMinP ts(p)|

)Further, the reachability distance of point p is definedas

reach_distMinP ts(p, o) = max(MinP ts_distance(o), dist(p, o))

Intuitively speaking, LOF of an object reflects thedensity contrast between its density and those of itsneighborhood. The neighborhood is defined by thedistance to the MinP tsth nearest neighbor. The localoutlier factor is a mean value of the ratio of the densitydistribution estimate in the neighborhood of the objectanalyzed to the distribution densities of its neighbors[21]. The lower the density of p and/or the higherthe densities of p’s neighbors, the larger the value ofLOF(p), which indicates that p has a higher degree ofbeing an outlier. A similar outlier-ness metric to LOF,called OPTICS-OF, was proposed in [20].

Unfortunately, the LOF method requires the compu-tation of LOF for all objects in the data set which israther expensive because it requires a large number ofkNN search. The high cost of computing LOF for eachdata point p is caused by two factors. First, we have tofind theMinP tsth nearest neighbor of p in order to spec-ify its neighborhood. This resembles to computing Dk

in detecting Dkn-Outliers. Secondly, after the MinP tsth-neighborhood of p has been determined, we have tofurther find the MinP tsth-neighborhood for each datapoints falling into the MinP tsth-neighborhood of p.This amounts to MinP tsth times in terms of computa-tion efforts as computing Dk when we are detecting Dkn-Outliers.

It is desired to constrain a search to only the top noutliers instead of computing the LOF of every object inthe database. The efficiency of this algorithm is boostedby an efficient micro-cluster-based local outlier miningalgorithm proposed in [52].

LOF ranks points by only considering the neighbor-hood density of the points, thus it may miss out the



J.Zhang

Figure 3. A sample dataset showing the advantage of LOF over DB(k, λ)-Outlier

potential outliers whose densities are close to those oftheir neighbors. Furthermore, the effectiveness of thisalgorithm using LOF is rather sensitive to the choiceof MinP ts, the parameter used to specify the localneighborhood.

B. COF MethodAs LOF method suffers the drawback that it may

miss those potential outliers whose local neighborhooddensity is very close to that of its neighbors. To addressthis problem, Tang et al. proposed a new Connectivity-based Outlier Factor (COF) scheme that improves theeffectiveness of LOF scheme when a pattern itself hassimilar neighborhood density as an outlier [87]. In orderto model the connectivity of a data point with respectto a group of its neighbors, a set-based nearest path(SBN-path) and further a set-based nearest trail (SBN-trail), originated from this data point, are defined. ThisSNB trail stating from a point is considered to be thepattern presented by the neighbors of this point. Basedon SNB trail, the cost of this trail, a weighted sum of thecost of all its constituting edges, is computed. The finaloutlier-ness metric, COF, of a point p with respect to itsk-neighborhood is defined as

COFk(p) =|Nk(p)| ∗ ac_distNk(p)(p)∑o∈Nk(p) ac_distNk(o)(o)

where ac_distNk(p)(p) is the average chaining distancefrom point p to the rest of its k nearest neighbors, whichis the weighted sum of the cost of the SBN-trail startingfrom p.

It has been shown in [87] that COF method isable to detect outlier more effectively than LOFmethod for some cases. However, COF method requiresmore expensive computations than LOF and thetime complexity is in the order of O(N2) for high-dimensional datasets.

C. INFLOMethodEven though LOF is able to accurately estimate

outlier-ness of data points in most cases, it fails todo so in some complicated situations. For instance,when outliers are in the location where the densitydistributions in the neighborhood are significantlydifferent, this may result in a wrong estimation. Anexample where LOF fails to have an accurate outlier-ness estimation for data points has been given in [53].The example is presented in Figure 4. In this example,data p is in fact part of a sparse cluster C2 which isnear the dense cluster C1. Compared to objects q and r,p obviously displays less outlier-ness. However, if LOFis used in this case, p could be mistakenly regarded tohaving stronger outlier-ness than q and r.

Authors in [53] pointed out that this problem ofLOF is due to the inaccurate specification of the spacewhere LOF is applied. To solve this problem of LOF,an improved method, called INFLO, is proposed [53].The idea of INFLO is that both the nearest neighbors(NNs) and reverse nearest neighbors (RNNs) of a datapoint are taken into account in order to get a betterestimation of the neighborhood’s density distribution.The RNNs of an object p are those data points that havep as one of their k nearest neighbors. By considering the




symmetric neighborhood relationship of both NN andRNN, the space of an object influenced by other objectsis well determined. This space is called the k-influencespace of a data point. The outlier-ness of a data point,called INFLuenced Outlierness (INFLO), is quantified.INFLO of a data point p is defined as

INFLOk(p) =denavg (ISk(p))

den(p)

INFLO is by nature very similar to LOF. With respectto a data point p, they are both defined as the ratio of p’sits density and the average density of its neighboringobjects. However, INFLO uses only the data points inits k-influence space for calculating the density ratio.Using INFLO, the densities of its neighborhood will bereasonably estimated, and thus the outliers found willbe more meaningful.

D. MDEF MethodIn [77], a new density-based outlier definition,

called Multi-granularity Deviation Factor (MEDF), isproposed. Intuitively, the MDEF at radius r for a pointpi is the relative deviation of its local neighborhooddensity from the average local neighborhood densityin its r-neighborhood. Let n(pi , αr) be the number ofobjects in the αr-neighborhood of pi and n̂(pi , r, α) bethe average, over all objects p in the r-neighborhood ofpi , of n(p, αr). In the example given by Figure 5, we haven(pi , αr) = 1, and n̂(pi , r, α) = (1 + 6 + 5 + 1)/4 = 3.25.

MDEF of pi , given r and α, is defined as

MDEF(pi , r, α) = 1 −n(pi , αr)n̂(pi, r, α)

where α = 12 . A number of different values are set for the

sampling radius r and the minimum and the maximumvalues for r are denoted by rmin and rmax. A point isflagged as an outliers if for any r ∈ [rmin, rmax], its MDEFis sufficient large.

E. Advantages and Disadvantages of Density-basedMethods

The density-based outlier detection methods aregenerally more effective than the distance-basedmethods. However, in order to achieve the improvedeffectiveness, the density-based methods are morecomplicated and computationally expensive. For adata object, they have to not only explore its localdensity but also that of its neighbors. Expensive kNNsearch is expected for all the existing methods in thiscategory. Due to the inherent complexity and non-updatability of their outlier-ness measurements used,LOF, COF, INFLO and MDEF cannot handle datastreams efficiently.

4.4. Clustering-based MethodsThe final category of outlier detection algorithm forrelatively low dimensional static data is clustering-based. Many data-mining algorithms in literature findoutliers as a by-product of clustering algorithms [6,11, 13, 46, 101] themselves and define outliers aspoints that do not lie in or located far apart fromany clusters. Thus, the clustering techniques implicitlydefine outliers as the background noise of clusters.So far, there are numerous studies on clustering, andsome of them are equipped with some mechanisms toreduce the adverse effect of outliers, such as CLARANS[73], DBSCAN [37], BIRCH [101], WaveCluster [82].More recently, we have seen quite a few clusteringtechniques tailored towards subspace clustering forhigh-dimensional data including CLIQUE [6] andHPStream [9].

Next, we will review several major categories ofclustering methods, together with the analysis on theiradvantages and disadvantages and their applicabilityin dealing with outlier detection problem for high-dimensional data streams.

A. Partitioning Clustering MethodsThe partitioning clustering methods perform cluster-

ing by partitioning the data set into a specific number ofclusters. The number of clusters to be obtained, denotedby k, is specified by human users. They typically startwith an initial partition of the dataset and then itera-tively optimize the objective function until it reachesthe optimal for the dataset. In the clustering process,center of the clusters (centroid-based methods) or thepoint which is located nearest to the cluster center(medoid-based methods) is used to represent a cluster.The representative partitioning clustering methods arePAM, CLARA, k-means and CLARANS.

PAM [62] uses a k-medoid method to identify theclusters. PAM selects k objects arbitrarily as medoidsand swap with objects until all k objects qualify asmedoids. PAM compares an object with entire dataset tofind a medoid, thus it has a slow processing time with acomplexity of O(k(N − k)2), where N is number of datain the data set and k is the number of clusters.

CLARA [62] tries to improve the efficiency of PAM.It draws a sample from the dataset and applies PAMon the sample that is much smaller in size than the thewhole dataset.k-means [68] initially choose k data objects as seeds

from the dataset. They can be chosen randomly or ina way such that the points are mutually farthest apart.Then, it examines each point in the dataset and assignsit to one of the clusters depending on the minimumdistance. The centroid’s position is recalculated andupdated the moment a point is added to the clusterand this continues until all the points are grouped intothe final clusters. The k-means algorithm is relatively



J.Zhang

Figure 4. An example where LOF does not work

Figure 5. Definition of MDEF

scalable and efficient in processing large datasetsbecause the computational complexity is O(nkt), wheren is total number of points, k is the number of clustersand t is the number of iterations of clustering. However,because it uses a centroid to represent each cluster, k-means suffers the inability to correctly cluster with alarge variation of size and arbitrary shapes, and it is alsovery sensitive to the noise and outliers of the datasetsince a small number of such data will substantiallyeffect the computation of mean value the moment a newobject is clustered.

CLARANS [73] is an improved k-medoid method,which is based on randomized search. It begins with arandom selection of k nodes, and in each of followingsteps, compares each node to a specific number of itsneighbors in order to find a local minimum. Whenone local minimum is found, CLARANS continues torepeat this process for another minimum until a specific

number of minima have been found. CLARANS hasbeen experimentally shown to be more effective thanboth PAM and CLEAR. However, the computationalcomplexity of CLARANS is close to quadratic w.r.tthe number of points [88], and it is prohibitive forclustering large database. Furthermore, the quality ofclustering result is dependent on the sampling method,and it is not stable and unique due to the characteristicsof randomized search.

B. Hierarchical Clustering MethodsHierarchical clustering methods essentially con-

structs a hierarchical decomposition of the wholedataset. It can be further divided into two categoriesbased on how this dendrogram is operated to generateclusters, i.e., agglomerative methods and divisive methods.An agglomerative method begins with each point as adistinct cluster and merges two closest clusters in each




subsequent step until a stopping criterion is met. Adivisive method, contrary to an agglomerative method,begins with all the point as a single cluster and splitsit in each subsequent step until a stopping criterionis met. Agglomerative methods are seen more popularin practice. The representatives of hierarchical methodsare MST clustering, CURE and CHAMELEON.

MST clustering [92] is a graph-based divisiveclustering algorithm. Given n points, a MST is a set ofedges that connects all the points and has a minimumtotal length. Deletion of edges with larger lengths willsubsequently generate a specific number of clusters.The overhead for MST clustering is determined bythe Euclidean MST construction, which is O(\lo}\) intime complexity, thus MST algorithm can be used forscalable clustering. However, MST algorithm can onlywork well on the clean dataset and are sensitive tooutliers. The intervention of outliers, termed "chaining-effect" (that is, a line of outliers between two distinctclusters will make these two clusters be marked as onecluster due to its adverse effect), will seriously degradethe quality of the clustering results.

CURE [46] employs a novel hierarchical clusteringalgorithm in which each cluster is represented by aconstant number of well-distributed points. A randomsample drawn from the original dataset is firstpartitioned and each partition is partially clustered.The partial clusters are then clustered in a secondpass to yield the desired clusters. The multiplerepresentative points for each cluster are picked to beas disperse as possible and shrink towards the centerusing a pre-specified shrinking factor. At each step ofthe algorithm, the two clusters with the closest pairof representative (this pair of representative points arefrom different clusters) points are merged. Usage ofmultiple points representing a cluster enables CUREto well capture the shape of clusters and makes itsuitable for clusters with non-spherical shapes andwide variance in size. The shrinking factor helps todampen the adverse effect of outliers. Thus, CURE ismore robust to outliers and identifies clusters havingarbitrary shapes.

CHAMELEON [55] is a clustering technique tryingto overcome the limitation of existing agglomerativehierarchical clustering algorithms that the clusteringis irreversible. It operates on a sparse graph inwhich nodes represent data points and weightededges represent similarities of among the data points.CHAMELEON first uses a graph partition algorithm tocluster the data points into a large number of relativelysmall sub-clusters. It then employs an agglomerativehierarchical clustering algorithm to genuine clustersby progressively merging these sub-clusters. Thekey feature of CHAMELEON lies in its mechanismdetermining the similarity between two sub-clustersin sub-cluster merging. Its hierarchical algorithm

takes into consideration of both inter-connectivity andcloseness of clusters. Therefore, CHAMELEON candynamically adapt to the internal characteristics of theclusters being merged.

C. Density-based Clustering MethodsThe density-based clustering algorithms consider

normal clusters as dense regions of objects in thedata space that are separated by regions of lowdensity. Human normally identify a cluster becausethere is a relatively denser region compared to itssparse neighborhood. The representative density-basedclustering algorithms are DBSCAN and DENCLUE.

The key idea of DBSCAN [37] is that for each pointin a cluster, the neighborhood of a given radius hasto contain at least a minimum number of points.DBSCAN introduces the notion of "density-reachablepoints" and based on which performs clustering. InDBSCAN, a cluster is a maximum set of density-reachable points w.r.t. parameters Eps and MinP ts,where Eps is the given radius and MinP ts is theminimum number of points required to be in the Eps-neighborhood. Specifically, to discover clusters in thedataset, DBSCAN examines the Eps-neighborhood ofeach point in the dataset. If the Eps-neighborhood ofa point p contains more than MinP ts, a new clusterwith p as the core object is generated. All the objectsfrom within this Eps-neighborhood are then assignedto this cluster. All this newly entry points will alsogo through the same process to gradually grow thiscluster. When there is no more core object can be found,another core object will be initiated and another clusterwill grow. The whole clustering process terminateswhen there are no new points can be added to anyclusters. As the clusters discovered are dependent onthe specification of the parameters, DBSCAN relies onthe user’s ability to select a good set of parameters.DBSCAN outperforms CLARANS by a factor of morethan 100 in terms of efficiency [37]. DBSCAN isalso powerful in discovering of clusters with arbitraryshapes. The drawbacks DBSCAN suffers are: (1) Itis subject to adverse effect resulting from "chaining-effect"; (2) The two parameters used in DBSCAN, i.e.,Eps and MinP ts, cannot be easily decided in advanceand require a tedious process of parameter tuning.

DENCLUE [51] performs clustering based on densitydistribution functions, a set of mathematical functionsused to model the influence of each point within itsneighborhood. The overall density of the data spacecan be modeled as sum of influence function of alldata points. The clusters can be determined by densityattractors. Density attractors are the local maximum ofthe overall density function. DENCLUE has advantagesthat it can well deal with dataset with a large number ofnoises and it allows a compact description of clustersof arbitrary shape in high-dimensional datasets. To



J.Zhang

facilitate the computation of the density function,DENCLUE makes use of grid-like structure. Noted thateven though it uses grids in clustering, DENCLUEis fundamentally different from grid-based clusteringalgorithm in that grid-based clustering algorithm usesgrid for summarizing information about the data pointsin each grid cell, while DENCLUE uses such structureto effectively compute the sum of influence functions ateach data point.

D. Grid-based Clustering MethodsGrid-based clustering methods perform clustering

based on a grid-like data structure with the aim ofenhancing the efficiency of clustering. It quantizes thespace into a finite number of cells which form a gridstructure on which all the operations for clusteringare performed. The main advantage of the approachesin this category is their fast processing time which istypically only dependent on the number of cells in thequantized space, rather than the number of data objects.The representatives of grid-based clustering algorithmsare STING, WaveCluster and DClust.

STING [88] divides the spatial area into rectangulargrids, and builds a hierarchical rectangle grids struc-ture. It scans the dataset and computes the necessarystatistical information, such as mean, variance, mini-mum, maximum, and type of distribution, of each grid.The hierarchical grid structure can represent the statis-tical information with different resolutions at differentlevels. The statistical information in this hierarchicalstructure can be used to answer queries. The likelihoodthat a cell is relevant to the query at some confidencelevel is computed using the parameters of this cell. Thelikelihood can be defined as the proportion of objectsin this cell that satisfy the query condition. After theconfidence interval is obtained, the cells are labeled asrelevant or irrelevant based on some confidence thresh-old. After examining the current layer, the clusteringproceeds to the next layer and repeats the process. Thealgorithm will subsequently only examine the relevantcells instead of all the cells. This process terminateswhen all the layers have been examined. In this way,all the relevant regions (clusters) in terms of query arefound and returned.

WaveCluster [82] is grid-based clustering algorithmbased on wavelet transformation, a commonly usedtechnique in signal processing. It transforms the multi-dimensional spatial data to the multi-dimensionalsignal, and it is able to identify dense regions in thetransformed domain that are clusters to be found.

In DClust [95], the data space involved is partitionedinto cells with equal size and data points are mappedinto the grid structure. A number of representativepoints of the database are picked using the densitycriterion. A Minimum Spanning Tree (MST) of theserepresentative points, denoted as R-MST, is built.

After the R-MST has been constructed, multi-resolutionclustering can be easily achieved. Suppose a user wantsto find k clusters. A graph search through the R-MSTis initiated, starting from the largest cost edge, to thelowest cost edge. As an edge is traversed, it is markedas deleted from the R-MST. The number of partitionsresulting from the deletion is computed. The processstops when the number of partitions reaches k. Anychange in the value of k simply implies re-initiatingthe search-and-marked procedure on the R-MST. Oncethe R-MST has been divided into k partitions, we cannow propagate this information to the original datasetso that each point in the dataset is assigned to oneand only one partition/cluster. DClust is equipped withmore robust outlier elimination mechanisms to identifyand filter the outliers during the various stages of theclustering process. First, DClust uses a uniform randomsampling approach to sample the large database. Thisis effective in ruling out the majority of outliers inthe database. Hence, the sample database obtainedwill be reasonably clean; Second, DClust employs agrid structure to identify representative points. Gridcells whose density is less than the threshold arepruned. This pre-filtering step ensures that the R-MSTconstructed is an accurate reflection of the underlyingcluster structure. Third, the clustering of representativepoints may cause a number of the outliers that are inclose vicinity to form a cluster. The number of pointsin such outlier clusters will be much smaller than thenumber of points in the normal clusters. Thus, anysmall clusters of representative points will be treated asoutlier clusters and eliminated. Finally, when the pointsin the dataset are labeled, some of these points maybe quite far from any representative point. DClust willregard such points as outliers and filter them out in thefinal clustering results.

E. Advantages and Disadvantage of Clustering-basedMethods

Detecting outliers by means of clustering analysis isquite intuitive and consistent with human perceptionof outliers. In addition, clustering is a well-establishedresearch area and there have been abundant clusteringalgorithms that users can choose from for performingclustering and then detecting outliers.

Nevertheless, many researchers argue that, strictlyspeaking, clustering algorithms should not be consid-ered as outlier detection methods, because their objec-tive is only to group the objects in dataset such that clus-tering functions can be optimized. The aim to eliminateoutliers in dataset using clustering is only to dampentheir adverse effect on the final clustering result. Thisis in contrast to the various definitions of outliers inoutlier detection which are more objective and indepen-dent of how clusters in the input data set are identified.One of the major philosophies in designing new outlier




detection approaches is to directly model outliers anddetect them without going though clustering the datafirst. In addition, the notions of outliers in the context ofclustering are essentially binary in nature, without anyquantitative indication as to how outlying each objectis. It is desired in many applications that the outlier-ness of the outliers can be quantified and ranked.

5. Outlier Detection Methods for High DimensionalDataThere are many applications in high-dimensionaldomains in which the data can contain dozens oreven hundreds of dimensions. The outlier detectiontechniques we have reviewed in the preceding sectionsuse various concepts of proximity in order to find theoutliers based on their relationship to the other pointsin the data set. However, in high-dimensional space,the data are sparse and concepts using the notion ofproximity fail to achieve most of their effectiveness.This is due to the curse of dimensionality that rendersthe high-dimensional data tend to be equi-distant toeach other as dimensionality increases. They does notconsider the outliers embedded in subspaces and arenot equipped with the mechanism for detecting them.

5.1. Methods for Detecting Outliers inHigh-dimensional DataTo address the challenge associated with high datadimensionality, two major categories of research workhave been conducted. The first category of methodsproject the high dimensional data to lower dimensionaldata. Dimensionality deduction techniques, such asPrincipal Component Analysis(PCA), Independent Compo-nent Analysis (ICA), Singular Value Decomposition (SVD),etc can be applied to the high-dimensional data beforeoutlier detection is performed. Essentially, this categoryof methods perform feature selection and can be consid-ered as the pre-processing work for outlier detection.The second category of approaches is more promisingyet challenging. They try to re-design the mechanism toaccurately capture the proximity relationship betweendata points in the high-dimensional space [14].

A. Sparse Cube Method. Aggarwal et al. conducted somepioneering work in high-dimensional outlier detection[15][14]. They proposed a new technique for outlierdetection that finds outliers by observing the densitydistributions of projections from the data. This newdefinition considers a point to be an outlier if insome lower-dimensional projection it is located in alocal region of abnormally low density. Therefore,the outliers in these lower-dimensional projections aredetected by simply searching for these projectionsfeaturing lower density. To measure the sparsityof a lower-dimensional projection quantitatively, the

authors proposed the so-called Sparsity Coefficient.The computation of Sparsity Coefficient involves agrid discretization of the data space and making anassumption of normal distribution for the data ineach cell of the hypercube. Each attribute of the datais divided into ϕ equi-depth ranges. In each range,there is a fraction f = 1/ϕ of the data. Then, a k-dimensional cube is made of ranges from k differentdimensions. Let N be the dataset size and n(D) denotethe number of objects in a k-dimensional cube D.Under the condition that attributes were statisticallyindependent, the Sparsity Coefficient S(D) of the cubeD is defined as:

S(D) =n(D) −N ∗ f k√N ∗ f k ∗ (1 − f k)

Since there are no closure properties for SparsityCoefficient, thus no fast subspace pruning can beperformed and the lower-dimensional projection searchproblem becomes a NP-hard problem. Therefore, theauthors employ evolutionary algorithm in order tosolve this problem efficiently. After lower-dimensionalprojections have been found, a post-processing phase isrequired to map these projections into the data points;all the sets of data points that contain in the abnormalprojections reported by the algorithm.

B. Example-based Method. Recently, an approach usingoutlier examples provided by users are used to detectoutliers in high-dimensional space [96][97]. It adoptsan ′′outlier examples→ subspaces→ outliers′′ mannerto detect outliers. Specifically, human users or domainexperts first provide the systems with a few initialoutlier examples. The algorithm finds the subspaces inwhich most of these outlier examples exhibit significantoutlier-ness. Finally, other outliers are detected fromthese subspaces obtained in the previous step. Thisapproach partitions the data space into equi-depthcells and employs the Sparsity Coefficient proposed in[14] to measure the outlier-ness of outlier examples ineach subspace of the lattice. Since it is untenable toexhaustively search the space lattice, the author alsoproposed to use evolutionary algorithms for subspacesearch. The fitness of a subspace is the average SparsityCoefficients of all cubes in that subspace to whichthe outlier examples belong. All the objects containedin the cubes which are sparser than or as sparse ascubes containing outlier examples in the subspace aredetected as outliers.

However, this method is limited in that it is onlyable to find the outliers in the subspaces where mostof the given user examples are outlying significantly. Itcannot detect those outliers that are embedded in othersubspaces. Its capability for effective outlier detectionis largely depended on the number of given examplesand, more importantly, how these given examples are



J.Zhang

similar to the majority of outliers in the dataset. Ideally,this set of user examples should be a good sampleof all the outliers in the dataset. This method workspoorly when the number of user examples is quitesmall and cannot provide enough clues as to wherethe majority of outliers in the dataset are. Providingsuch a good set of outlier examples is a difficult taskwhatsoever. The reasons are two-fold. First, it is nottrivial to obtain a set of outlier examples for a high-dimensional data set. Due to a lack of visualizationaid in high-dimensional data space, it is not obviousat all to find the initial outlier examples unless theyare detected by some other techniques. Secondly andmore importantly, even when a set of outliers havealready been obtained, testing the representativeness ofthis outlier set is almost impossible. Given these twostrong constraints, this approach becomes inadequatein detecting outliers in high-dimensional datasets. Itwill miss out those projected outliers that are notsimilar to those given outlier examples.

C. Outlier Detection in Subspaces. Since outlier-ness ofdata points mainly appear significant in some sub-spaces of moderate dimensionality in high-dimensionalspace and the quality of the outliers detected varies indifferent subspaces consisting of different combinationsof dimension subsets. The authors in [24] employ evo-lutionary algorithm for feature selection (find optimaldimension subsets which represent the original datasetwithout losing information for unsupervised learningtask of outlier detection as well as clustering). Thisapproach is a wrapper algorithm in which the dimen-sion subsets are selected such that the quality of outlierdetected or the clusters generated can be optimized. Theoriginality of this work is to combine the evolutionaryalgorithm with the data visualization technique uti-lizing parallel coordinates to present evolution resultsinteractively and allow users to actively participatein evolutionary algorithm searching to achieve a fastconvergence of the algorithm.

D. Subspace Outlier Detection for Categorical Data. Das etal. study the problem of detecting anomalous recordsin categorical data sets [33]. They draw on a probabilityapproach for outlier detection. For each record in thedata set, the probabilities for the occurrence of differentsubsets of attributes are investigated. A data record islabeled as an outlier if the occurrence probability forthe values of some of its attribute subsets is quite low.Specifically, the probability for two subsets of attributesat and bt to occur together in a record, denoted byr(at , bt), is quantified as:

r(at , bt) =P (at , bt)P (at)P (bt)

Due to the extremely large number of possible attributesubsets, only the attribute subsets with a length notexceeding than k are studied.

Because it always evaluates pairs of attribute subsets,each of which contain at least one attribute, therefore,this method will miss out the abnormality evaluationfor 1-dimensional attribute subsets. In addition, dueto the exponential growth of the number of attributesubsets w.r.t k, the value of k is set typically small in thismethod. Hence, this method can only cover attributesubsets not larger than 2k for a record (this methodevaluates a pair of attribute subsets at a time). Thislimits the ability of this method for detecting recordsthat have outlying attribute subsets larger than 2k.

5.2. Outlying Subspace Detection forHigh-dimensional DataAll the outlier detection algorithms that we have dis-cussed so far, regardless of in low or high dimensionalscenario, invariably fall into the framework of detectingoutliers in a specific data space, either in full space orsubspace. We term these methods as “space→ outliers′′

techniques. For instance, outliers are detected by firstfinding locally sparse subspaces [14], and the so-calledStrongest/Weak Outliers are discovered by first findingthe Strongest Outlying Spaces [59].

A new research problem called outlying subspacedetection for multiple or high dimensional data hasbeen identified recently in [94][99][93]. The majortask of outlying subspace detection is to find thosesubspaces (subset of features) in which the data pointsof interest exhibit significant deviation from the rest ofpopulation. This problem can be formulated as follows:given a data point or object, find the subspaces inwhich this data is considerably dissimilar, exceptionalor inconsistent with respect to the remaining pointsor objects. These points under study are called querypoints, which are usually the data that users areinterested in or concerned with. As in [94][99], adistance threshold T is utilized to decide whether or nota data point deviates significantly from its neighboringpoints. A subspace s is called an outlying subspace ofdata point p ifODs(p) ≥ T , whereOD is the outlier-nessmeasurement of p.

Finding the correct subspaces so that outliers can bedetected is informative and useful in many practicalapplications. For example, in the case of designinga training program for an athlete, it is critical toidentify the specific subspace(s) in which an athletedeviates from his or her teammates in the dailytraining performances. Knowing the specific weakness(subspace) allows a more targeted training programto be designed. In a medical system, it is usefulfor the Doctors to identify from voluminous medicaldata the subspaces in which a particular patient is




found abnormal and therefore a corresponding medicaltreatment can be provided in a timely manner.

The unique feature of the problem of outlyingsubspace detection is that, instead of detecting outliersin specific subspaces as did in the classical outlierdetection techniques, it involves searching from thespace lattice for the associated subspaces wherebythe given data points exhibit abnormal deviations.Therefore, the problem of outlying subspace detectionis called an ′′outlier → spaces′′ problem so as todistinguish the classical outlier detection problemwhich is labeled as a ′′space→ outliers′′ problem. It hasbeen theoretically and experimentally shown that theconventional outlier detection methods, irrespectivelydealing with low or high-dimensional data, cannotsuccessfully cope with the problem of outlyingsubspace detection problem in [94]. The existing high-dimensional outlier detection techniques, i.e., findoutliers in given subspaces, are theoretically applicableto solve the outlying detection problem. To do this, wehave to detect outliers in all subspaces and a searchin all these subspaces is needed to find the set ofoutlying subspaces of p, which are those subspaces inwhich p is in their respective set of outliers. Obviously,the computational and space costs are both in anexponential order of d, where d is the number ofdimensions of the data point. Such an exhaustivespace searching is rather expensive in high-dimensionalscenario. In addition, they usually only return the topn outliers in a given subspace, thus it is impossible tocheck whether or not p is an outlier in this subspaceif p is not in this top n list. This analysis provides aninsight into the inherent difficulty of using the existinghigh-dimensional outlier detection techniques to solvethe new outlying subspace detection problem.

A. HighDoDZhang et al. proposed a novel dynamic subspace

search algorithm, called HighDoD, to efficientlyidentify the outlying subspaces for the given querydata points [94][99]. The outlying measure, OD, isbased on the sum of distances between a data andits k nearest neighbors [10]. This measure is simpleand independent of any underlying statistical anddistribution characteristics of the data points. Thefollowing two heuristic pruning strategies employingupward-and downward closure property are proposedto aid in the search for outlying subspaces: If a pointp is not an outlier in a subspace s, then it cannot bean outlier in any subspace that is a subset of s. If apoint p is an outlier in a subspace s, then it will be anoutlier in any subspace that is a superset of s. These twoproperties can be used to quickly detect the subspacesin which the point is not an outlier or the subspacesin which the point is an outlier. All these subspacescan be removed from further consideration in the later

stage of the search process. A fast dynamic subspacesearch algorithm with a sample-based learning processis proposed. The learning process aims to quantitizethe prior probabilities for upward- and downwardpruning in each layer of space lattice. The Total SavingFactor (TSF) of each layer of subspaces in the lattice,used to measure the potential advantage in savingcomputation, is dynamically updated and the search isperformed in the layer of lattice that has the highest TSFvalue in each step of the algorithm.

However, HighDoD suffers the following majorlimitations. First, HighDoD relies heavily on the closure(monotonicity) property of the outlying measurementof data points, termed OD, to perform the fastbottom-up or top-down subspace pruning in thespace lattice, which is the key technique HighDoDutilizes for speeding up subspace search. Under thedefinition of OD, a subspace will always be morelikely to be an outlying subspace than its subsetsubspaces. This is because that OD of data pointswill be naturally increased when the dimensionalityof the subspaces under study goes up. Nevertheless,this may not be a very accurate measurement. Thedefinition of a data point’s outlier-ness makes moresense if its measurement can be related to other points,meaning that the averaged level of the measurementfor other points in the same subspace should be takeninto account simultaneously in order to make themeasurement statistically significant. Therefore, thedesign of a new search method is desired in thissituation. Secondly, HighDoD labels each subspacein a binary manner, either an outlying subspace ora non-outlying one, and most subspaces are prunedaway before their outlying measurements are virtuallyevaluated in HighDoD. Thus, it is not possible forHighDoD to return a ranked list of the detectedoutlying subspaces. Apparently, a ranked list will bemore informative and useful than an unranked onein many cases. Finally, a human-user defined cutofffor deciding whether a subspace is outlying or notwith respect to a query point is used. This parameterwill define the "outlying front" (the boundary betweenthe outlying subspaces and the non-outlying ones).Unfortunately, the value of this parameter cannot beeasily specified due to the lack of prior knowledgeconcerning the underlying distribution of data pointthat maybe very complex in the high-dimensionalspaces.

B. SOF MethodIn [100], a novel technique based on genetic

algorithm is proposed to solve the outlying subspacedetection problem and well copes with the drawbacksof the existing methods. A new metric, called SubspaceOutlying Factor (SOF), is developed for measuringthe outlying degree of each data point in different



J.Zhang

subspaces. Based on SOF, a new definition of outlyingsubspace, called SOF Outlying Subspaces, is proposed.Given an input dataset D, parameters n and k, asubspace s is a SOF Outlying Subspace for a givenquery data point p if there are no more than n −1 other subspaces s′ such that SOF(s′ , p) > SOF(s, p).The above definition is equivalent to say that thetop n subspaces having the largest SOF values areconsidered to be outlying subspaces. The parametersused in defining SOF Outlying Subspaces are easy to bespecified, and do not require any prior knowledge aboutthe data distribution of the dataset. A genetic algorithm(GA) based method is proposed for outlying subspacedetection. The upward and downward closure propertyis no longer required in the GA-based method, and thedetected outlying subspaces can be ranked based ontheir fitness function values. The concepts of the lowerand upper bounds of Dk , the distance between a givenpoint and its kth nearest neighbor, are proposed. Thesebounds are used for a significant performance boostin the method by providing a quick approximation ofthe fitness of subspaces in the GA. A technique is alsoproposed to compute these bounds efficiently using theso-called kNN Look-up Table.

5.3. Clustering Algorithms for High-dimensional DataWe have witnessed some recent developments ofclustering algorithms towards high-dimensional data.As clustering provides a possible, even though not thebest, means to detect outliers, it is necessary for usto review these new developments. The representativemethods for clustering high-dimensional data areCLIQUE and HPStream.

A. CLIQUECLIQUE [7] is a grid-based clustering method

that discretizes the data space into non-overlappingrectangular units, which are obtained by partitioningevery dimension into a specific number of intervalsof equal length. A unit is dense if the fraction oftotal data points contained in this unit is greaterthan a threshold. Clusters are defined as unions ofconnected dense units within a subspace. CLIQUEfirst identifies a subspace that contains clusters.A bottom-up algorithm is used that exploits themonotonicity of the clustering criterion with respect todimensionality: if a k-dimensional unit is dense, thenso are its projections in (k-1) -dimensional space. Acandidate generation procedure iteratively determinesthe candidate k-dimensional units Ck after determiningthe (k-1)-dimensional dense units Dk−1. A pass is madeover the data to determine those candidates units thatare dense Dk . A depth-first search algorithm is thenused to identify clusters in the subspace: it startswith some unit u in D, assign it the first clusterlabel number, and find all the units it is connected

to. Then, if there are still units in D that have yetbeen visited, it finds one and repeats the procedure.CLIQUE is able to automatically finds dense clusters insubspaces of high-dimensional dataset. It can produceidentical results irrespective of the order in whichinput data are presented and not presume any specificmathematical form of data distribution. However, theaccuracy of this clustering method maybe degraded dueto the simplicity of this method. The clusters obtainedare all of the rectangular shapes, which is obviouslynot consistent with the shape of natural clusters. Inaddition, the subspaces obtained are dependent on thechoice of the density threshold. CLIQUE uses a globaldensity threshold (i.e., a parameter that is used for allthe subspaces), thus it is difficult to specify its valueespecially in high-dimensional subspaces due to curseof dimensionality. Finally, the subspaces obtained arethose where dense units exist, but this has nothing todo with the existence of outliers. As a result, CLIQUE isnot suitable for detecting projected outliers.

B. HPStreamIn order to find the clusters embedded in the

subspaces of high-dimensional data space in datastreams, a new clustering method, called HPStream,is proposed [9]. HPStream introduces the conceptof projected clustering to data streams as significantand high-quality clusters only exist in some low-dimensional subspaces. The basic idea of HPStream isthat it does not only find clusters but also updatesthe set of dimensions associated with each clusterwhere more compact clusters can be found. The totalnumber of clusters obtained in HPStream is initiallyobtained through k−means clustering and the initial setof dimensions associated with each of these k clusters isthe full set of dimensions of the data stream. As morestreaming data arrive, the set of dimensions for eachcluster evolves such that each cluster can become morecompact with a smaller radius.

HPStream is innovative in finding clusters that areembedded in subspaces for high-dimensional datastreams. However, the number of subspaces returned byHPStream is equal to the number of clusters obtainedthat is typically of a small value. Consequently, ifHPStream is applied to detect projected outliers, thenit will only be able to detect the outliers in thosesubspaces returned and miss out a significant potionsof outliers existing in other subspaces that are notreturned by HPStream. Of course, it is possible toincrease the number of subspaces returned in orderto improve the detection rate. However, the increaseof subspaces will imply an increase of the number ofclusters accordingly. An unreasonably large number ofclusters is not consistent with the formation of naturalclusters and will therefore affect the detection accuracyof projected outliers.




6. Outlier Detection Methods for Data StreamsThe final major category of outlier detection methodswe will discuss in this section are those outlier detectionmethods for handling data streams. We will firstdiscuss Incremental LOF, and then the outlier detectionmethods for sensor networks that use Kernel densityfunction. The incremental clustering methods that canhandle continuously arriving data will also be coveredat the end of this subsection.

A. Incremental LOF MethodSince LOF method is not able to handle data streams,

thus an incremental LOF algorithm, appropriate fordetecting outliers from dynamic databases where fre-quently data insertions and deletions occur, is pro-posed in [78]. The proposed incremental LOF algorithmprovides an equivalent detection performance as theiterated static LOF algorithm (applied after insertionof each data record), while requiring significantly lesscomputational time. In addition, the incremental LOFalgorithm also dynamically updates the profiles of datapoints. This is an appealing property, since data profilesmay change over time. It is shown that insertion of newdata points as well as deletion of obsolete points influ-ence only limited number of their nearest neighbors andthus insertion/deletion time complexity per data pointdoes not depend on the total number of points N [78].

The advantage of Incremental LOF is that it candeal with data insertions and deletions efficiently.Nevertheless, Incremental LOF is not economic inspace. The space complexity of this method is in theorder of the data that have been inserted but have notbeen deleted. In other words, Incremental LOF has tomaintain the whole length of data stream in order todeal with continuously arriving data because it does notutilize any compact data summary or synopsis. This isclearly not desired for data stream applications that aretypically subject to explicit space constraint.

B. Outlier Detection Methods for Sensor NetworksThere are a few recent anomaly detection methods for

data streams. They mainly come from sensor networksdomain such as [80] and [25]. However, the major efforttaken in these works is the development of distributableoutlier detection methods from distributed datastreams and does not deal with the problem of outlierdetection in subspaces of high-dimensional data space.Palpanas et al. proposed one of the first outlier detectionmethods for distributed data streams in the contextof sensor networks [80]. The author classified thesensor nodes in the network as the low capacity andhigh capacity nodes, through which a multi-resolutionstructure of the sensor network is created. The highcapacity nodes are nodes equipped with relativelystrong computational strength that can detect localoutliers. The Kernel density function is employed to

model local data distribution in a single or multipledimensions of space. A point is detected as an outlierif the number of values that have fallen into itsneighborhood (delimited by a sphere of radius r) isless than an application-specific threshold. The numberof values in the neighborhood can be computed bythe Kernel density function. Similarly, the authors in[25] also emphasize the design of distributed outlierdetection methods. Nevertheless, this work employs anumber of different commonly used outlier-ness metricsuch as the distance to kth nearest neighbor, averagedistance to the k nearest neighbors, the inverse ofthe number of neighbors within a specific distance.Nevertheless, these metrics are not applicable to datastreams.

C. Incremental Clustering Methods

Most clustering algorithms we have discussed earlierin this section assume a complete and static datasetto operate. However, new data becomes continuouslyavailable in many applications such as the datastreams. With the aforementioned classical clusteringalgorithms, reclustering from scratch to account fordata updates is too costly and inefficient. It is highlydesired that the data can be processed and clusteredin an incremental fashion. The recent representativeclustering algorithms having mechanisms to handledata updates are BIRCH*, STREAM and CluStream.

BIRCH* [45] is a framework for fast, scalable andincremental clustering algorithms. In the BIRCH*family of algorithms, objects are read from thedatabases sequentially and inserted into incrementallyevolving clusters which are represented by generalizedcluster features (CF*s), the condensed and summarizedrepresentation of clusters. A new objects reading fromthe databases is inserted into the closest cluster.BIRCH* organizes all clusters in an in-memory index,and height-balanced tree, called CF*-tree. For a newobject, the search for an appropriate cluster requirestime logarithmic in the number of the clusters toa linear scan. CF*s are efficient because: (1) theyoccupy much less space than the naive representation;(2) the calculation of inter-cluster and intra-clustermeasurements using the CF* is much faster thancalculations involving all objects in clusters. Thepurpose of the CF*-tree is to direct a new object tothe cluster closest to it. The non-leaf and leaf entriesfunction differently, non-leaf entries are used to guidenew objects to appropriate leaf clusters, whereas leafentries represent the dynamically evolving clusters.However, clustering of high-dimensional datasets hasnot been studied in BIRCH*. In addition, BIRCH*cannot perform well when the clusters are not sphericalin shape due to the fact that it relies on sphericalsummarization to produce the clusters.



J.Zhang

STREAM [74] considers the clustering of continu-ously arriving data, and provides a clustering algorithmsuperior to the commonly used k-means algorithm.STREAM assumes that the data actually arrives inchunks X1, X2, · · · , Xn, each of which fits into mainmemory. The streaming algorithm is as follows. Foreach chunk i, STREAM first assigns weight to pointsin the chunks according to their respective appear-ance frequency in the chunks ensuring that each pointappear only once. The STREAM clusters each chunkusing procedure LOCALSEARCH. For each chunk, onlyk weighted cluster centers are retained and the wholechunk is discarded in order to free the memory fornew chunks. Finally, LOCALSEARCH is applied tothe weighted centers retained from X1, X2, · · · , Xn, toobtain a set of (weighted) centers for the entire streamX1, X2, · · · , Xn.

In order to find clusters in different time horizons(such as the last month, last year or last decade), a newclustering method for data stream, called CluStream, isproposed in [8]. This approach provides the user theflexibility to explore the nature of the evolution of theclusters over different time periods. In order to avoidbookkeeping the huge amount of information about theclustering results in different time horizons, CluStreamdivides the clustering process into an online micro-clustering component and an offine macro-clusteringcomponent. The micro-clustering phase mainly collectsonline the data statistics for clustering purpose. Thisprocess is not dependent on any user input such asthe time horizon or the required granularity of theclustering process. The aim is to maintain statistics ata sufficiently high level of granularity so that it can beeffectively used by the offline components of horizon-specific macro-clustering as well as evolution analysis.The micro-clusters generated by the algorithm serve asan intermediate statistical representation which can bemaintained in an efficient way even for a data streamof large volume. The macro-clustering process does notwork on the original data stream that may be very largein size. Instead, it uses the compactly stored summarystatistics of the micro-clusters. Therefore, the micro-clustering phase is not subject to the one-pass constraintof data stream applications.

D. Advantages and Disadvantages of Outlier Detec-tion for Data Streams

The methods discussed in this subsection can detectoutliers from data streams. The incremental LOFmethod is able to deal with continuously arriving data,but it may face an explosion of space consumption.Moreover, the incremental LOF method is not able tofind outliers in subspaces in an automatic manner. Theoutlier detection methods for sensor networks cannotfind projected outliers either. Unlike the clusteringmethods that are only appropriate for static databases,

BIRCH*, STREAM and CluStream go one step furtherand are able to handle incrementally the continuouslyarriving data. Nevertheless, they are designed to useall the features of data in detecting outliers and aredifficult to detect projected outliers.

6.1. SummaryThis section presents a comprehensive survey on themajor existing methods for detecting point outliersfrom vector-like data sets. Both the conventional outlierdetection methods that are mainly appropriate forrelatively low dimensional static databases and themore recent methods that are able to deal withhigh-dimensional projected outliers or data streamapplications have been discussed. For a big picture ofthese methods, we present a summary in Table 6. Inthis table, we evaluate each method against two criteria,namely whether it can detect projected outliers in ahigh-dimensional data space and whether it can handledata streams. The symbols of tick and cross in the tableindicate respectively whether or not the correspondingmethod satisfies the evaluation criteria. From this table,we can see that the conventional outlier detectionmethods cannot detect projected outliers embedded indifferent subspaces; they detect outliers only in the fulldata space or a given subspace. Amongst these methodsthat can detect projected outliers, only HPStreamcan meet both criteria. However, being a clusteringmethod, HPStream cannot provide satisfactory supportfor projected outliers detection from high-dimensionaldata streams.

7. ConclusionsIn this paper, a comprehensive survey is presented toreview the existing methods for detecting point outliersfrom various kinds of vector-like datasets. The outlierdetection techniques that are primarily suitable forrelatively low-dimensional static data, which serve thetechnical foundation for many of the methods proposedlater, are reviewed first. We have also reviewed someof recent advancements in outlier detection for dealingwith more complex high-dimensional static data anddata streams.

It is important to be aware of the limitation ofthis survey. As it has clearly stated in Section 2, weonly focus on the point outlier detection methodsfrom vector-like datasets due to the space limit. Also,outlier detection is a fast developing field of researchand more new methods will quickly emerge in theforeseeable near future. Driven by their emergence,it is believed that outlier detection techniques willplay an increasingly important role in various practicalapplications where they can be applied to.




Figure 6. A summary of major existing outlier detection methods

References

[1] C. C. Aggarwal. On Abnormality Detection in SpuriouslyPopulated Data Streams. SIAM International Conferenceon Data Mining (SDM’05), Newport Beach, CA, 2005.

[2] B. Abraham and G. E. P. Box. Bayesian analysis of someoutlier problems in time series. Biometrika 66, 2, 229-236, 1979.

[3] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I.Nishizawa, J. Rosenstein an J. Widom. STREAM: TheStanford Stream Data Manager, SIGMOD’03, 2003.

[4] B. Abraham and A. Chuang. Outlier detection and timeseries modeling. Technometrics 31, 2, 241-248, 1989.

[5] D. Anderson, T. Frivold, A. Tamaru, and A. Valdes. Next-generation intrusion detection expert system (nides),software users manual, beta-update release. TechnicalReport, Computer Science Laboratory, SRI International,1994.

[6] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan.Automatic subspace clustering of high dimensional data

for data mining applications. In Proc. of 1998 ACMSIGMOD International Conference on Management of Data(SIGMOD’98), pp 94-105, 1998.

[7] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan.Automatic Subspace Clustering of High DimensionalData Mining Application. In proceeding of ACMSIGMOD’99, Philadelphia, PA, USA, 1999.

[8] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: A Frameworkfor Clustering Evolving Data Streams. In Proc. of29th Very Large Data Bases (VLDB’03),pp 81-92, Berlin,Germany, 2003.

[9] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Frameworkfor Projected Clustering of High Dimensional DataStreams. In Proc. of 30th Very Large Data Bases(VLDB’04), pp 852-863, Toronto, Canada, 2004.

[10] F. Angiulli and C. Pizzuti. Fast Outlier Detection in HighDimensional Spaces. In Proc. of 6th European Conferenceon Principles and Practice of Knowledge Discovery inDatabases (PKDD’02),Helsinki, Finland, pp 15-26, 2002.



J.Zhang

[11] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu andJ. S. Park. Fast algorithms for projected clustering. InProc. of 1999 ACM SIGMOD International Conference onManagement of Data (SIGMOD’99), pp 61-72, 1999.

[12] D. Anderson, A. Tamaru, and A. Valdes. Detectingunusual program behavior using the statistical com-ponents of NIDES. Technical Report, Computer ScienceLaboratory, SRI International, 1995.

[13] C. C. Aggarwal and P. Yu. Finding generalized projectedclusters in high dimensional spaces. In Proc. of 2000ACM SIGMOD International Conference on Managementof Data (SIGMOD’00), pp 70-81, 2000.

[14] C. C. Aggarwal and P. S. Yu. Outlier Detectionin High Dimensional Data. In Proc. of 2001 ACMSIGMOD International Conference on Management of Data(SIGMOD’01), Santa Barbara, California, USA, 2001.

[15] Charu C. Aggarwal and Philip S. Yu. 2005. An effectiveand efficient algorithm for high-dimensional outlierdetection. VLDB Journal, 14: 211-221, Springer-VerlagPublisher.

[16] V. Barnett. The ordering of multivariate data (withdiscussion). Journal of the Royal Statistical Society. SeriesA 139, 318-354, 1976.

[17] C. Bishop. Novelty detection and neural networkvalidation. In Proceedings of IEEE Vision, Image andSignal Processing, Vol. 141. 217-222, 1994.

[18] R. J. Beckman and R. D. Cook. Outliers. Technometrics 25,2, 119-149, 1983.

[19] K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft.When is nearest neighbors meaningful? In Proc. of 7thInternational Conference on Database Theory (ICDT’99),pp 217-235, Jerusalem, Israel, 1999.

[20] M. M. Breunig, H-P Kriegel, R. T. Ng and J. Sander.OPTICS-OF: Identifying Local Outliers. PKDD’99, 262-270, 1999.

[21] M. Breuning, H-P. Kriegel, R. Ng, and J. Sander.LOF: Identifying Density-Based Local Outliers. In Proc.of 2000 ACM SIGMOD International Conference onManagement of Data (SIGMOD’00), Dallas, Texas, pp 93-104, 2000.

[22] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger.The R∗-tree: an efficient and robust access methodfor points and rectangles. In Proc. of 1990 ACMSIGMOD International Conference on Management of Data(SIGMOD’90), pp 322-331, Atlantic City, NJ,1990.

[23] V. Barnett and T. Lewis. Outliers in Statistical Data. JohnWiley, 3rd edition, 1994.

[24] L. Boudjeloud and F. Poulet. Visual Interactive Evolu-tionary Algorithm for High Dimensional Data Cluster-ing and Outlier Detection. In Proc. of 9th Pacific-AsiaConference on Advances in Knowledge Discovery and DataMining (PAKDD’05), Hanoi, Vietnam, pp426-431, 2005.

[25] Branch, J. Szymanski, B. Giannella, C. Ran WolffKargupta, H. n-Network Outlier Detection in WirelessSensor Networks. In Proc. of. 26th IEEE InternationalConference on Distributed Computing Systems (ICDCS),2006.

[26] H. Cui. Online Outlier Detection Detection Over DataStreams. Master thesis, Simon Fraser University, 2002.

[27] D. Chakrabarti. Autopart: Parameter-free graph parti-tioning and outlier detection. In PKDD’04, pages 112-124, 2004.

[28] V. Chandola, A. Banerjee, and V. Kumar. OutlierDetection-A Survey, Technical Report, TR 07-017, Depart-ment of Computer Science and Engineering, Universityof Minnesota, 2007.

[29] C. Chow and D. Y. Yeung. Parzen-window network intru-sion detectors. In Proceedings of the 16th InternationalConference on Pattern Recognition, Vol. 4, Washington,DC, USA, 40385, 2002.

[30] D. E. Denning. An intrusion detection model. IEEETransactions of Software Engineering 13, 2, 222-232, 1987.

[31] M. Desforges, P. Jacob, and J. Cooper. Applicationsof probability density estimation to the detection ofabnormal conditions in engineering. In Proceedings ofInstitute of Mechanical Engineers, Vol. 212. 687-703, 1998.

[32] D. Dasgupta and F. Nino. A comparison of negative andpositive selection algorithms in novel pattern detection.In Proceedings of the IEEE International Conference onSystems, Man, and Cybernetics, Vol. 1. Nashville, TN, 125-130, 2000.

[33] K. Das and J. G. Schneider: Detecting anomalous recordsin categorical datasets. KDD’07, 220-229, 2007.

[34] E. Eskin. Anomaly detection over noisy data usinglearned probability distributions. In Proceedings of theSeventeenth International Conference on Machine Learning(ICML). Morgan Kaufmann Publishers Inc., 2000.

[35] D. Endler. Intrusion detection: Applying machinelearning to solaris audit data. In Proceedings of the 14thAnnual Computer Security Applications Conference, 268,1998.

[36] E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo.A Geometric Framework for Unsupervised AnomalyDetection: Detecting Intrusions in Unlabeled Data.Applications of Data Mining in Computer Security, 2002.

[37] M. Ester, H-P Kriegel, J. Sander, and X.Xu. A Density-based Algorithm for Discovering Clusters in LargeSpatial Databases with Noise. In proceedings of 2ndInternational Conference on Knowledge Discovery and DataMining (KDD’96), Portland, Oregon, USA, 1996.

[38] E. Eskinand and S. Stolfo. Modeling system callfor intrusion detection using dynamic window sizes.In Proceedings of DARPA Information SurvivabilityConference and Exposition, 2001.

[39] A. J. Fox. Outliers in time series. Journal of the RoyalStatistical Society, Series B (Methodological) 34, 3, 350-363, 1972.

[40] T. Fawcett. and F. Provost. Activity monitoring: noticinginteresting changes in behavior. In Proceedings of the5th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 53-62, 1999.

[41] D. Freedman, R. Pisani and R. Purves. Statistics, W. W.Norton, New York, 1978.

[42] F. Grubbs Procedures for detecting outlying observationsin samples. Technometrics 11, 1, 1-21, 1969.

[43] D. E. Goldberg. Genetic Algorithms in Search, Optimiza-tion, and Machine Learning. Addison-Wesley, Reading,Massachusetts, 1989.

[44] S. Guttormsson, R. M. II, and M. El-Sharkawi. Ellipticalnovelty grouping for on-line short-turn detection of




excited running rotors. IEEE Transactions on EnergyConversion 14, 1, 1999.

[45] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J.French. Clustering Large Datasets in Arbitrary MetricSpaces. In Proc.s of the 15th International Conference onData Engineering (ICDE’99), Sydney, Australia, 1999.

[46] S. Guha, R. Rastogi, and K. Shim. CURE: An EfficientClustering Algorithm for Large Databases. In Proceed-ings of the 1998 ACM SIGMOD International Conferenceon Management of Data (SIGMOD’98), Seattle, WA, USA,1998.

[47] D. Hawkins. Identification of Outliers. Chapman and Hall,London, 1980.

[48] P. Helman and J. Bhangoo. A statistically basedsystem for prioritizing information exploration underuncertainty. In IEEE International Conference on Systems,Man, and Cybernetics, Vol. 27, 449-466, 1997.

[49] P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effectof outliers and nonhealthy individuals on referenceinterval estimation. Clinical Chemistry 47, 12, 2137-2145,2001.

[50] J. Han and M Kamber. Data Mining: Concepts andTechniques. Morgan Kaufman Publishers, 2000.

[51] A. Hinneburg, and D.A. Keim. An Efficient Approachto Cluster in Large Multimedia Databases with Noise.KDD’98, 1998.

[52] W. Jin, A. K. H. Tung and J. Han. Finding Top nLocal Outliers in Large Database. In Proc. of 7th ACMInternational Conference on Knowledge Discovery and DataMining (SIGKDD’01), San Francisco, CA, pp 293-298,2001.

[53] W. Jin, A. K. H. Tung, J. Han and W. Wang: RankingOutliers Using Symmetric Neighborhood Relationship.PAKDD’06, 577-593, 2006.

[54] H. S. Javitz and A. Valdes. The SRI IDES statisticalanomaly detector. In Proceedings of the 1991 IEEESymposium on Research in Security and Privacy, 1991.

[55] G. Karypis, E-H. Han, and V. Kumar. CHAMELEON:A Hierarchical Clustering Algorithm Using DynamicModeling. IEEE Computer, 32, Pages 68-75, 1999.

[56] E. M. Knorr and R. T. Ng. A unified approach for miningoutliers. CASCON’97, 11, 1997.

[57] E. M. Knorr and R. T. Ng. A Unified Notion of Outliers:Properties and Computation. KDD’97, 219-222, 1997.

[58] E. M. Knorr and R. T. Ng. Algorithms for MiningDistance-based Outliers in Large Dataset. In Proc. of24th International Conference on Very Large Data Bases(VLDB’98), New York, NY, pp 392-403, 1998.

[59] E. M. Knorr and R. T. Ng (1999). Finding IntentionalKnowledge of Distance-based Outliers. In Proc. of25th International Conference on Very Large Data Bases(VLDB’99), Edinburgh, Scotland, pp 211-222, 1999.

[60] E. M. Knorr, R. T. Ng and V. Tucakov. Distance-BasedOutliers: Algorithms and Applications. VLDB Journal,8(3-4): 237-253, 2000.

[61] T. M. Khoshgoftaar, S. V. Nath, and S. Zhong. IntrusionDetection in Wireless Networks using ClusteringsTechniques with Expert Analysis. Proceedings of theFourth International Conference on Machine Leaning andApplications (ICMLA’05), Los Angeles, CA, USA, 2005.

[62] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data:an Introduction to Cluster Analysis. John wiley&Sons,1990.

[63] C. Kruegel, T. Toth, and E. Kirda. Service specificanomaly detection for network intrusion detection. InProceedings of the 2002 ACM Symposium on Appliedcomputing, 201-208, 2002.

[64] X. Li and J. Han: Mining Approximate Top-K SubspaceAnomalies in Multi-Dimensional Time-Series Data.VLDB, 447-458, 2007.

[65] J. Lee, J. Han and X. Li. Trajectory Outlier Detection:A Partition-and-Detect Framework. ICDE’08, 140-149,2008.

[66] J. Laurikkala, M. Juhola1, and E. Kentala. 2000. Informalidentification of outliers in medical data. In FifthInternational Workshop on Intelligent Data Analysis inMedicine and Pharmacology, 20-24, 2000.

[67] G. Manson. Identifying damage sensitive, environmentinsensitive features for damage detection. In Proceedingsof the IES Conference. Swansea, UK, 2002.

[68] J. MacQueen. Some methods for classification andanalysis of multivariate observations. In Proc. of 5thBerkeley Symp. Math. Statist, Prob., 1, pages 281-297,1967.

[69] M. V. Mahoney and P. K. Chan. Learning nonstationarymodels of normal network traffic for detecting novelattacks. In Proceedings of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 376-385, 2002.

[70] G. Manson, G. Pierce, and K. Worden. On the long-termstability of normal condition for damage detection in acomposite panel. In Proceedings of the 4th InternationalConference on Damage Assessment of Structures, Cardiff,UK, 2001.

[71] G. Manson, S. G. Pierce, K. Worden, T. Monnier, P.Guy, and K. Atherton. Long-term stability of normalcondition data for novelty detection. In Proceedings ofSmart Structures and Integrated Systems, 323-334, 2000.

[72] C. C. Noble and D. J. Cook. Graph-based anomalydetection. In KDD’03, pages 631-636, 2003.

[73] R. Ng and J. Han. Efficient and Effective ClusteringMethods for Spatial Data Mining. In proceedings of the20th VLDB Conference, pages 144-155, 1994.

[74] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R.Motwani. Streaming-Data Algorithms For High-QualityClustering. In Proceedings of the 18th InternationalConference on Data Engineering (ICDE’02), San Jose,California, USA, 2002.

[75] M. I. Petrovskiy. Outlier Detection Algorithms in DataMining Systems. Programming and Computer Software,Vol. 29, No. 4, pp 228-237, 2003.

[76] E. Parzen. On the estimation of a probability densityfunction and mode. Annals of Mathematical Statistics 33,1065-1076, 1962.

[77] S. Papadimitriou, H. Kitagawa, P. B. Gibbons and C.Faloutsos: LOCI: Fast Outlier Detection Using the LocalCorrelation Integral. ICDE’03, 315, 2003.

[78] D. Pokrajac, A. Lazarevic, L. Latecki. Incremental LocalOutlier Detection for Data Streams, IEEE symposiums oncomputational Intelligence and Data Mining (CIDM’07),504-515, Honolulu, Hawaii, USA, 2007.



J.Zhang

[79] P. A. Porras and P. G. Neumann. EMERALD: Eventmonitoring enabling responses to anomalous live dis-turbances. In Proceedings of 20th NIST-NCSC NationalInformation Systems Security Conference, 353-365, 1997.

[80] T. Palpanas, D. Papadopoulos, V. Kalogeraki, D.Gunopulos. Distributed deviation detection in sensornetworks. SIGMOD Record 32(4): 77-82, 2003.

[81] S. Ramaswamy, R. Rastogi, and S. Kyuseok. EfficientAlgorithms for Mining Outliers from Large Data Sets.In Proc. of 2000 ACM SIGMOD International Conferenceon Management of Data (SIGMOD’00), Dallas, Texas, pp427-438, 2000.

[82] G. Sheikholeslami, S. Chatterjee, and A. Zhang.WaveCluster: A Wavelet based Clustering Approach forSpatial Data in Very Large Database. VLDB Journal, vol.8(3-4), pages 289-304, 1999.

[83] H. E. Solberg and A. Lahti. Detection of outliers inreference distributions: Performance of horn’s algorithm.Clinical Chemistry 51, 12, 2326-2332, 2005.

[84] J. Sun, H. Qu, D. Chakrabarti and C. Faloutsos.Neighborhood Formation and Anomaly Detection inBipartite Graphs. ICDM’05, 418-425, 2005.

[85] J. Sun, H. Qu, D. Chakrabarti and C. Faloutsos.Relevance search and anomaly detection in bipartitegraphs. SIGKDD Explorations 7(2): 48-55, 2005.

[86] L. Tarassenko. Novelty detection for the identification ofmasses in mammograms. In Proceedings of the 4th IEEEInternational Conference on Artificial Neural Networks,Vol. 4. Cambridge, UK, 442-447, 1995.

[87] J. Tang, Z. Chen, A. Fu, and D. W. Cheung. EnhancingEffectiveness of Outlier Detections for Low DensityPatterns. In Proc. of 6th Pacific-Asia Conference onKnowledge Discovery and Data Mining (PAKDD’02),Taipei, Taiwan, 2002.

[88] W. Wang, J. Yang, and R. Muntz. STING: A StatisticalInformation Grid Approach to Spatial Data Mining. InProceedings of 23rd VLDB Conference, pages 186-195,Athens, Green, 1997.

[89] W. Wang, J. Zhang and H. Wang. Grid-ODF: DetectingOutliers Effectively and Efficiently in Large Multi-dimensional Databases. In Proc. of 2005 InternationalConference on Computational Intelligence and Security(CIS’05), pp 765-770, Xi’an, China, 2005.

[90] K. Yamanishi and J. I. Takeuchi. Discovering outlierfiltering rules from unlabeled data: combining asupervised learner with an unsupervised learner. InProceedings of the 7th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 389-394, 2001.

[91] K. Yamanishi, J. I. Takeuchi, G. Williams, and P. Milne.On-line unsupervised outlier detection using finite

mixtures with discounting learning algorithms. DataMining and Knowledge Discovery 8, 275-300, 2004.

[92] C.T. Zahn. Graph-theoretical Methods for Detectingand Describing Gestalt Clusters. IEEE Transaction onComputing, C-20, pages 68-86, 1971.

[93] J. Zhang, Q. Gao and H. Wang. Outlying SubspaceDetection in High dimensional Space. Encyclopedia ofDatabase Technologies and Applications (2nd Edition), IdeaGroup Publisher, 2009.

[94] J. Zhang and H. Wang. Detecting Outlying Subspacesfor High-dimensional Data: the New Task, Algorithmsand Performance. Knowledge and Information Systems: AnInternational Journal (KAIS), Springer-Verlag Publisher,2006.

[95] J. Zhang, W. Hsu and M. L. Lee. Clustering in DynamicSpatial Databases. Journal of Intelligent InformationSystems (JIIS) 24(1): 5-27, Kluwer Academic Publisher,2005.

[96] C. Zhu, H. Kitagawa and C. Faloutsos. Example-BasedRobust Outlier Detection in High Dimensional Datasets.In Proc. of 2005 IEEE International Conference on DataManagement(ICDM’05), pp 829-832, 2005.

[97] C. Zhu, H. Kitagawa, S. Papadimitriou and C. Faloutsos.OBE: Outlier by Example. PAKDD’04, 222-234, 2004.

[98] S. Zhong, T. M. Khoshgoftaar, and S. V. Nath. Aclustering approach to wireless network intrusiondetection. In ICTAI, pages 190-196, 2005.

[99] J. Zhang, M. Lou, T. W. Ling and H. Wang. HOS-Miner: A System for Detecting Outlying Subspaces ofHigh-dimensional Data. In Proc. of 30th InternationalConference on Very Large Data Bases (VLDB’04), demo,pages 1265-1268, Toronto, Canada, 2004.

[100] J. Zhang, Q. Gao and H. Wang. A Novel Methodfor Detecting Outlying Subspaces in High-dimensionalDatabases Using Genetic Algorithm. 2006 IEEE Interna-tional Conference on Data Mining (ICDM’06), pages 731-740, Hong Kong, China, 2006.

[101] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:An Efficient Data Clustering Method for Very LargeDatabases. In proceedings of the 1996 ACM InternationalConference on Management of Data (SIGMOD’96), pages103-114, Montreal, Canada, 1996.

[102] J. Zhang and H. Wang. Detecting Outlying Subspacesfor High-dimensional Data: the New Task, Algorithmsand Performance. Knowledge and Information Systems(KAIS), 333-355, 2006.

[103] J. Zhang, Q. Gao, H. Wang, Q. Liu, K. Xu. DetectingProjected Outliers in High-Dimensional Data Streams.DEXA 2009: 629-644, 2009.



ICST Transactions on Scalable Information Systems Research ... · ICST Transactions on Scalable...

Documents

Transcript of ICST Transactions on Scalable Information Systems Research ... · ICST Transactions on Scalable...