Computing & Information Sciences Kansas State University Monday, 31 Mar 2008CIS 732 / 830: Machine...

Computing & Information SciencesKansas State University

Monday, 31 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI

Lecture 26 of 42

Monday, 31 March 2008

William H. Hsu

Department of Computing and Information Sciences, KSU

KSOL course pages: http://snurl.com/1ydii / http://snipurl.com/1y5ih

Course web site: http://www.kddresearch.org/Courses/Spring-2008/CIS732

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading:

Today: Sections 7.9 – 7.11, 2.6, Han & Kamber 2e

Wednesday: 8.1 – 8.2, Han & Kamber 2e

Outlier Detection



Outlier DetectionOutlier Detection

Lian Duan

Management Sciences, UIOWA



What are outliers?What are outliers?

Hawkins-Outlier: An outlier is an observation that deviates so much from other observations as to arouse suspicion that it is generated by a different mechanism.

A relative concept: Situation Your angle A example: Suppose you are the US president. Common Thing: Compare to History and Majority



Outlier Detection and ClusteringOutlier Detection and Clustering

Interwoven with each other. Not all objects should belong to a certain cluster. Abnormal events might have temporal or spatial locality. (Body Temperature)

Single Point Outliers Cluster-based Outleirs



Previous WorkPrevious Work

DB(pct,dmin)-Outlier [Binary]: Given an object p, at least percentage pct of the objects in D lies greater than distance dmin from p.

Density-based local outlier [Degree]: Given the lowest acceptable bound of LOF, an object p in a dataset D is a density-based local outlier if LOF(p)>LOFLB.

Other statistical methods.



Local Outlier FactorLocal Outlier Factor

Local Density: the inverse of the average distance to its k-nearest neighbors.

Local Outlier Factor: the ratio of the local density of p and those of p’s k-nearest neighbors. The LOF of each object depends on the density of the cluster relative

to it and the distance between it and the cluster.



Illustration Of LOFIllustration Of LOF

A example:

LOF-Outlier vs. DB(pct,dmin)-Outlier



LDBSCAN=DBSCAN+LOFLDBSCAN=DBSCAN+LOF

DBSCAN: Retrieve all points which is density-reachable from the given Core-Point(MinPts, ε).

Problem: How many are many?



LDBSCAN (continued)LDBSCAN (continued)

A relative concept of core points and similarity. Core Points: LOF<LOFUB Similarity: p N∈ MinPts(q) and LRD(q)/(1+pct)<LRD(p)<LRD(q)*(1+pct)




The same clustering idea with DBSCAN

Parameter: LOFUB pct



AdvantageAdvantage

Density-based vs Partitioning Clustering: Small clusters, arbitrary shape, and noise.



Advantage (continued)Advantage (continued)

LDBSCAN vs DBSCAN Easier to select proper parameters. Handle local density problems.



Advantage (continued)Advantage (continued)

LDBSCAN vs OPTICS Comet-like clusters Hierarchical structure



PerformancePerformance

Experiment facility: P 2.4G, 512M memory, redhat 9.0, jdk1.4.2Ⅳ Algorithm steps:

Search k-nearest neighbors: O(n2) or O(nlogn) Calculate LRDs and LOFs: O(n) Clustering: O(n)

Its compute complexity isequal to that of LOF.



ExperimentExperiment

Wisconsin Breast Cancer Data After data preprocessing, the resultant dataset has 327 (57.8%)

benign records and 239 (42.2%) malignant records with nine attributes.

Discover two clusters and five single point outliers.Cluster A contains 296 benign records and 6 malignant records. Its

average local density is 0.743.Cluster B contains 26 benign records and 233 malignant records. Its

average local density is 0.167.Five single point outlier whose LOFs fall into the range from 3 to 5.



Experiment (continued)Experiment (continued)

Boston Housing Data After data preprocessing, the resultant dataset has 506 records with 14

attributes. Cluster: (1, 82, 0.556); (2, 345, 0.528); (3, 26, 0.477); (4, 34, 0.266); (5, 9,

0.228); (6, 6, 0.127). 4 single point outliers. Cluster 5 vs Cluster 6 (from cluster 1)

24.514 (bigger per capita cirme rate) vs 20.005; 284th record (from cluster 4): LRD=0.155, LOF=1.468.

2nd attribute: higher proportion of residential land zoned for lots. 3rd attribute: lower proportion of non-retail bussiness acres per town.



Appendix: Cluster-based OutliersAppendix: Cluster-based Outliers

Definition 1 (Upper Bound of the Cluster-Based Outlier): Let C1, ..., Ck be the clusters of the database D discovered by LDBSCAN in the sequence that |C1|≥|C2|≥…≥|Ck|. Given parameters α, the number of the objects in the cluster Ci is the UBCBO if (|C1|+|C2|+…+|Ci-1|)≥|D|*α and (|C1|+|C2|+…+|Ci-2|) ＜ |D|*α.

Definition 2 (Cluster-based outlier): Let C1, ..., Ck be the clusters of the database D discovered by LDBSCAN. Cluster-based outliers are the clusters in which the number of the objects is no more than UBCBO.

Definition 3 (Cluster-based outlier factor): Let C1 be a cluster-based outlier and C2 be the nearest non-outlier cluster of C1. The cluster-based outlier factor of C1 is defined as

2

||/)(*),(*||)( 22111Cp

i

i

CplrdCCdistCCCBOF

2

||/)(*),(*||)( 22111Cp

i

i

CplrdCCdistCCCBOF



Experiment (continued)Experiment (continued)

Abnormal Network Throughput Detection Network throughput has the

characteristic that are consistent with self-similarity.

Monitoring 300 nodes per 5 minutes: 3600 per hour

Single point VS. Cluster-based 30 VS. 3 alerts per hour Occasional fluctuations VS. Abnormal

events over a period



ConclusionConclusion

Outlier detection and clustering improve accuracy with each other. Cluster-based outlier detection is more meaningful. ADVERTISING: LDBSCAN is good at both outlier detection and

clustering. Clusters with arbitrary shape and different local density Single point outliers and cluster-based outliers Degree of outliers



Outlier Detection for High Dimensional Data

Outlier Detection for High Dimensional Data

Chilly (Ruohan) Wu

2/26/2006



Basic Information of paperBasic Information of paper

Charu C. Aggarwal & Philip S. Yu IBM T. J. Watson Research Center ACM SIGMOD 2001 May 21-24, Santa Barbara, California USA



What’s outlier?What’s outlier?

Outlier data point which is very different from the rest of the data based on

some measure. Hawkins – Outlier (generally accepted, formal)

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

often contains useful information on abnormal behavior of the system described by the data

Application credit card fraud, network intrusion detection, financial applications

and marketing



Existing methods of outlier detectionExisting methods of outlier detection

Distribution-Based Methods Data points are modeled using a stochastic distribution Outliers are observations which deviate from the given distribution.

Distance-Based methods define outliers by using the full dimensional distances of the points

from one another



Existing methods of outlier detection (cont)Existing methods of outlier detection (cont)

Clustering-Based Methods Outlier as a side-product of cluster Outliers are points which do not lie in clusters

Density-Based Methods Based on the densities of local neighborhoods

But these methods do not work quite as well when the dimensionality is high



What is special in high dimensional space?What is special in high dimensional space?

Domain in which the data can hundreds of dimensions. It is very difficult and inaccurate to estimate the multidimensional distributions

of the data points

The data is sparse in high dimensionality The concept of locality becomes difficult to define

The actual values of distances for any pair of points are similar in high dimensional space It is difficult to find out outliers based on distance Meaningful clusters cannot be found

every point is an almost equally good outlier the notion of proximity fails to retain its meaningfulness



Example: Several 2-dimensional cross-section of a high dimensional data set

Example: Several 2-dimensional cross-section of a high dimensional data set



Desiderata for High Dimensional Outlier Detection Algorithms

Desiderata for High Dimensional Outlier Detection Algorithms

handle the sparsity problems in high dimensionality effectively. provide interpretability in terms of the reasoning which creates the

abnormality. Proper measures must be identified in order to account for the physical

significance of the definition of an outlier in k-dimensional subspace The algorithms should continue to be computationally efficient for very

high dimensional problems. The algorithms should provide importance to the local data behavior

while determining whether a point is an outlier.

a distance based threshold for an outlier in a k-dimensional subspace is not directly comparable to one in (k + 1)-dimensional subspace.

algorithms should be devised which avoid a combinatorial exploration of the search space.



Defining Outlier in Lower Dimensional Projections

Defining Outlier in Lower Dimensional Projections

The essential idea behind this technique define outliers by examining those projections of the data which have

abnormally low density Defining Abnormal Lower Dimensional Projections

Abnormal lower dimensional projection is one in which the density of the data is exceptionally lower than average.



Sparisity coefficientsSparisity coefficients

Each attribute of the data is divided into equi-depth ranges. Thus, each range contains a fraction f = 1/ of the records.

N points in the k-d database (uniform distribution) The probability of any point in a k-d cube is The expected fraction and standard deviation of the points in a k-d cube is

and N(D) is the number of points in a k-d cube D Sparsity coefficient S(D) of cube D

kf

kN f (1 )k kN f f

( )( )

(1 )

k

k k

n D N fS D

N f f



Sparisity coefficientsSparisity coefficients

Only sparisity coefficients which are negative indicate cubes in which the presence of the points is significantly lower than expected

In general, the uniformly distributed assumption is not true. However, the sparsity coefficient provides an intuitive idea of the level of

significance for a given projection.

( )( )

(1 )

k

k k

n D N fS D

N f f



Brut-force AlgorithmBrut-force Algorithm

d: total dimensionality of the data k: the dimension of the projection which is used to determine the

outlier m: the number of input projections to be determmined The algorithm works by examining all possible sets of k-

dimensional candidate projections (with corresponding grid ranges) and retaining the m projections which have the most negative sparsity coefficients.



Brut-force AlgorithmBrut-force Algorithm

<-concatenation



An overview of Evolutionary SearchAn overview of Evolutionary Search

The fundamental idea underlying: in nature, resources are scarce and this leads to a competition among

the species. Consequently, all the species undergo a selection mechanism, in which only the fittest survive. Consequently, the fitter individuals tend to mate each other more often, resulting in still better individuals. At the same time, nature occasionally throws in a variant by the process of mutation, so as to ensure sufficient amount of diversity among the species, and hence also a greater scope for improvement.




Each feasible solution to the problem is defined as an individual This feasible solution is in the form of a string and is the genetic representation of

the individual.

The process of conversion of feasible solutions of the problem into string representations is called coding. For example, a possible coding for a feasible solution to the traveling salesman

problem could be a string containing a sequence of numbers representing the order in which he visits the cities




The genetic material at each locus on the string is referred to as a gene

the possible values that gene could possibly take on are the alleles.

The measure of fitness of an individual is evaluated by the fitness function, which has as its argument the string representation of the individual and returns a value indicating its fitness. the better the objective function value, the betterthe fitness value.




As the process of evolution progresses, the individuals in the population become more and more genetically similar to each other. This phenomenon is referred to as convergence.

Dejong defined convergence of a gene as the stage at which 95% of the population had the same value for that gene.

The population is said to have converged when all genes have converged.



The Evolutionary Outlier Detection AlgorithmThe Evolutionary Outlier Detection Algorithm

:the grid range for the i-th dimension can take any of the values 1 throught

or it can take on the value * ( don’t care) Take a total of +1 values

Example(4-d, =10) A possible solution to the problem : *3*9 The fitness for the corresponding solution may be computed using the

sparsity coefficient discussed earlier.

imim



The Evolutionary Outlier Detection AlgorithmThe Evolutionary Outlier Detection Algorithm

starts with a population of p random solutions and iteratively used the processes of selection, crossover and mutation in order to perform a combination of hill climbing, solution recombination and random search over the space of possible projections.

The process was continued until the population converged to a global optimum. De Jong convergence criterion to determine the termination condition.

At each stage of the algorithm, the m best projection solutions (most negative sparsity coefficients) were kept track of.

At the end of the algorithm, these solutions were reported as the best projections in the data.



The outlier detection AlgorithmThe outlier detection Algorithm

<-S:the poputation of solutions in any iteration



The Selection Criterion for the Genetic AlgorithmThe Selection Criterion for the Genetic Algorithm

roulette wheel mechanism the probability of sampling a string from the population was

proportional to p-r(i), where p is the total number of strings, and r(i) is the rank of the ith string.

the strings are ordered in such a way that the strings with the most negative sparsity coefficients occur first.

the most abnormally sparse solutions are

likely to have a greater number of copies



The Selection Criterion for the Genetic AlgorithmThe Selection Criterion for the Genetic Algorithm



The Crossover AlgorithmThe Crossover Algorithm

Unbiased two-point Crossover determining a point in the string at random called the crossover point, and

exchanging the segments to the right of this point For example, consider the strings 3*2*1 and 1*33*. If the crossover is

performed after the third position, then the two resulting strings are 3*23* and 1*3*1.

if the crossover occurred after the fourth position, then the two resulting children strings would be 3*231 and 1*3**.

In general, since the evolutionary algorithm only finds projections of a given dimensionality in a run, this kind of crossover mechanism often creates infeasible solutions after the crossover process




it is desirable that the two children obtained after solution recombination also correspond to a k-dimensional projection

Different positions in the string Type I: Both strings have a don't care. Type II: Neither has a don't care. Let us assume that there are k’<=k

positions of this type. Type III: One has a don't care. Since each string as exactly k-k’ such

positions in each string; and these positions are disjoint. Thus, there are a total of 2(k-k’) such positions.



Optimized CrossoverOptimized Crossover

create at least one child string from the two parent strings which is a fitter solution recombination than either parent.

find the best possible recombination from the two parents There are a total of possibilities for the children.

Observe k’ is typically quite small First, search the space of the possibilities for the Type II positions for the best possible

combination then, use greedy algorithm in order to find a solution recombinant for the (k-k’) Type III

position we always extend the string with the position which results in the string with most negative

sparsity coefficient. We keep extending the string for an extra (k-k’) positions until all k positions have been set. (first child S)

The second child is created by always picking the positions from a different parent than the one from which the string S derives its positions.

' (2 2 ')( ')2 ( )k k kk k

'2k



The Mutation AlgorithmThe Mutation Algorithm

mutations of two types: Type 1:affact the position which is *

Let Q be the set of positions in the string which are * . Then pick a position in the string which is not in Q and change it to . At the same time, we change a randomly picked position in Q to a number between 1 and .

Type 2:only affact the position which is not * The value of such a position is changed from a value between 1 and to

another value between 1 and . With P1 and P2, we perform mutation of Type 1 and Type 2

For the purpose of our implementation P1=P2



The Mutation AlgorithmThe Mutation Algorithm



Choice of Projection ParametersChoice of Projection Parameters

K and each subcube represented by a k-dimensional projection contains an expected fraction

of 1/ of the data If k=4, =10, N<10,000,each cube will contain less than one point should be picked high enough that there are sufficient number of intervals on each

dimension that corresponds to a reasonable notion of locality. We calculate the sparsity coefficient of cube by a choice of sparsity coefficient of s=-3 would result in 99:9% level of significance that the

given data cube contains less points than expected and is hence an abnormally sparse projection.

k

1k

N

2log ( / 1)k N s



EMPIRICAL RESULTSEMPIRICAL RESULTS

We tested the performance of the method using both the brute-force and the evolutionary technique

the brute-force technique required considerably more computational resources than the evolutionary search technique for high dimensional data sets.

in order to find k-dimensional projections of a d-dimensional problem, there are a total of possibilities. D=20,k=4, =10,result in 7*10^7 possibilities

( )d kk



EMPIRICAL RESULTSEMPIRICAL RESULTS

Computing & Information Sciences Kansas State University Monday, 31 Mar 2008CIS 732 / 830: Machine...

Documents

Transcript of Computing & Information Sciences Kansas State University Monday, 31 Mar 2008CIS 732 / 830: Machine...