Optics ordering points to identify the clustering structure

40
(Paper Presentation) OPTICS-Ordering Points To Identify The Clustering Structure Presenter Anu Singha Asiya Naz Rajesh Piryani South Asian University

description

(Paper Presentation) OPTICS-Ordering Points To Identify The Clustering Structure

Transcript of Optics ordering points to identify the clustering structure

Page 1: Optics ordering points to identify the clustering structure

(Paper Presentation)

OPTICS-Ordering Points To Identify The Clustering Structure

PresenterAnu SinghaAsiya Naz

Rajesh PiryaniSouth Asian University

Page 2: Optics ordering points to identify the clustering structure

OUTLINE

Introduction

Definition (Directly Density Reachable, Density Reachable, Density Connected,

OPTICS Algorithm

Example

Graphical Results

April 30,2012 2

Page 3: Optics ordering points to identify the clustering structure

CLUSTERING

Goal

Group objects into meaningful subclasses as part of an exploratory processto insight into data or as a preprocessing step for other algorithms.

Clustering Strategies

Hierarchical

Partitioning

k-means

Density Based

April 30,2012 3

Page 4: Optics ordering points to identify the clustering structure

DENSITY BASED CLUSTERING

Density-based Clustering locates regions of high density that are separated from

one another by regions of low density.

Density = number of points within a specified radius (Eps)

April 30,2012 4

Page 5: Optics ordering points to identify the clustering structure

DENSITY BASED CLUSTERING

Flat Clustering

one level of clusters

Hierarchical Clustering

nested clusters

e.g. density-based clustering algorithm

DBSCAN [KDD 96]

e.g. density-based clustering algorithm

OPTICS [SIGMOD 99]

April 30,2012 5

Page 6: Optics ordering points to identify the clustering structure

INTRODUCTION

DBSCAN can cluster objects given input parameters such as

(Eps) (the maximum radius of a neighborhood) and

MinPts (the minimum number of points required in the neighborhood of acore object),

it encumbers users with the responsibility of selecting parameter values that willlead to the discovery of acceptable clusters.

Such parameter settings are usually empirically set and difficult to determine.

Moreover, real-world, high-dimensional data sets often have very skeweddistributions such that their intrinsic clustering structure may not be wellcharacterized by a single set of global density parameters.

April 30,2012 6

Page 7: Optics ordering points to identify the clustering structure

INTRODUCTION

density-based clusters are monotonic with respect to the neighborhoodthreshold.

In DBSCAN, for a fixed MinPts value and two neighborhood thresholds,

(Eps) 1 < (Eps) 2, a cluster C with respect to (Eps)1 and

MinPts must be a subset of a cluster C’ with respect to (Eps) 2 and MinPts.

This means that if two objects are in a density-based cluster, they must also bein a cluster with a lower density requirement.

Different clusters may have very different densities

Clusters may be in hierarchies

April 30,2012 7

To overcome the difficulty in using one set of global parameters in clustering analysis, acluster analysis method called OPTICS was proposed.

Page 8: Optics ordering points to identify the clustering structure

OPTICS

in figure 3, where

C1 and C2 are density-based clusterswith respect to e2 < e1

and C is a density based cluster withrespect to e1 completely containing thesets C1 and C2.

for a constant MinPts-value, density-based clusters with respect to a higherdensity (i.e. a lower value for e) arecompletely contained in density-connected sets with respect to a lower

density (i.e. a higher value for e).

April 30,2012 8

Page 9: Optics ordering points to identify the clustering structure

OPTICS

To produce a consistent result

obey a specific order in which objects are processed when expanding a cluster.

select an object which is density-reachable with respect to the lowest ε value to guarantee that clusters w.r.t higher density (i.e. smaller e values) are finished first.

OPTICS works in principle like such an extended DBSCAN algorithm for an

infinite number of distance parameters εi which are smaller than a “generatingdistance” ε (i.e. 0 ≤ εi ≤ ε).

The only difference is that we do not assign cluster memberships.

Instead, we store the order in which the objects are processed and theinformation which would be used by an extended DBSCAN algorithm to assigncluster memberships if this were at all possible for an infinite number ofparameters).

April 30,2012 9

Page 10: Optics ordering points to identify the clustering structure

OPTICS

OPTICS does not explicitly produce a data set clustering.

It outputs a cluster ordering.

It is linear list of all objects under analysis and

represents the density-based clustering structure of the data.

Objects in a denser cluster are listed closer to each other in the cluster ordering.

Ordering is equivalent to density-based clustering obtained from a wide range ofparameter settings.

Thus, OPTICS does not require the user to provide a specific density threshold.

The cluster ordering can be used to extract basic clustering information (e.g.,cluster centers, or arbitrary-shaped clusters), derive the intrinsic clusteringstructure, as well as provide a visualization of the clustering

April 30,2012 10

Page 11: Optics ordering points to identify the clustering structure

OPTICS (CONTINUED..)

To construct the different clusterings simultaneously, the objects are processedin a specific order.

This order selects an object that is density-reachable with respect to the lowest(Eps) value so that clusters with higher density (lower (Eps)) will be finishedfirst.

Based on this idea, OPTICS needs two important pieces of information perobject:

Core Distance

Reachability Distance

April 30,2012 11

It was presented by Mihael Ankerst, Markus M. Breunig,Hans-Peter

Kriegel and Jörg Sander.

Page 12: Optics ordering points to identify the clustering structure

TERMINOLOGIES

ε-Neighborhood

Objects within a radius of ε from an object. (epsilon-neighborhood)

Core objects

ε-Neighborhood of an object contains at least MinPts of objects

April 30,2012 12

q pεε

ε-Neighborhood of p

ε-Neighborhood of q

p is a core object (MinPts = 4)

q is not a core object

Page 13: Optics ordering points to identify the clustering structure

TERMINOLOGIES

Directly Density Reachable

An object q is directly density-reachable from object p if q is within the ε-Neighborhood of p and p is a core object

April 30,2012 13

q pεε

q is directly density-reachable from p

p is not directly density- reachable from q?

Page 14: Optics ordering points to identify the clustering structure

TERMINOLOGIES

Density Reachable

An object p is density-reachable from q w.r.t ε and MinPts if there is achain of objects p1,…,pn, with p1=q, pn=p such that pi+1is directly density-reachable from pi w.r.t ε and MinPts for all 1 <= i <= n

April 30,2012 14

p

q is density-reachable from p

p is not density- reachable from q>

Transitive closure of direct density-Reachability, asymmetric

q

Page 15: Optics ordering points to identify the clustering structure

TERMINOLOGIES

Definition: core-distance

Definition: reachability-distance

otherwise)(dist

|),(rangeQuery|if)(distcore ,

oMinPts

MinPtsooMinPts

reach dist ( , ) max(core dist ( ),dist( , )), , MinPts MinPtsp o o p o

core-distance(o)

o

reachability-distance(p,o)

p

p

reachability-distance(p,o)

MinPts = 5

April 30,2012 15

Page 16: Optics ordering points to identify the clustering structure

ABOUT OPTICS COMPUTATION

It computes an ordering of all objects in a given database. And

It stores the core-distance and a suitable reachability-distance for each objectin the database.

OPTICS maintains a list called OrderSeeds to generate the output ordering.

Objects in OrderSeeds

are sorted by the reachability-distance from their respective closest coreobjects,

that is, by the smallest reachability-distance of each object.

April 30,2012 16

Page 17: Optics ordering points to identify the clustering structure

ABOUT OPTICS ALGORITHM

Begin with an arbitrary object from the input database as the current object, p.

It retrieves the ε-neighborhood of p, determines the core-distance, and setsthe reachability-distance to undefined.

The current object, p, is then written to output.

If p is not a core object,

OPTICS simply moves on to the next object in the OrderSeeds list (or theinput database if OrderSeeds is empty).

April 30,2012 17

Page 18: Optics ordering points to identify the clustering structure

ABOUT OPTICS ALGORITHM

If p is a core object,

then for each object, q, in the ε-neighborhood of p,

OPTICS updates its reachability-distance from p

and inserts q into OrderSeeds if q has not yet been processed.

The iteration continues until the input is fully consumed and OrderSeeds isempty.

April 30,2012 18

Page 19: Optics ordering points to identify the clustering structure

ALGORITHM

OPTICS (SetOfObjects, e, MinPts, OrderedFile)

OrderedFile.open();

FOR i FROM 1 TO SetOfObjects.size DO

Object := SetOfObjects.get(i);

IF NOT Object.Processed THEN

ExpandClusterOrder(SetOfObjects, Object, e, MinPts, OrderedFile)

OrderedFile.close();

END; // OPTICS

April 30,2012 19

Page 20: Optics ordering points to identify the clustering structure

PROCEDURE FOR ExpandClusterOrder

ExpandClusterOrder(SetOfObjects, Object, ε, MinPts, OrderedFile); neighbors := SetOfObjects.neighbors(Object, ε); Object.Processed := TRUE; Object.reachability_distance := UNDEFINED;

Object.setCoreDistance(neighbors, ε, MinPts); OrderedFile.write(Object); IF Object.core_distance <> UNDEFINED THEN

OrderSeeds.update(neighbors, Object); WHILE NOT OrderSeeds.empty() DO

currentObject := OrderSeeds.next();

neighbors:=SetOfObjects.neighbors(currentObject, ε); currentObject.Processed := TRUE;

currentObject.setCoreDistance(neighbors, ε, MinPts); OrderedFile.write(currentObject); IF currentObject.core_distance<>UNDEFINED THEN

OrderSeeds.update(neighbors, currentObject); END; // ExpandClusterOrder

April 30,2012 20

object is simply written to the file OrderedFile with its coredistance and its current reachability-distance.

Page 21: Optics ordering points to identify the clustering structure

OrderSeeds::update()

OrderSeeds::update(neighbors, CenterObject); c_dist := CenterObject.core_distance; FORALL Object FROM neighbors DO

IF NOT Object.Processed THEN new_r_dist:=max(c_dist,CenterObject.dist(Object)); IF Object.reachability_distance=UNDEFINED THEN

Object.reachability_distance := new_r_dist; insert(Object, new_r_dist);

ELSE // Object already in OrderSeeds IF new_r_dist<Object.reachability_distance THEN

Object.reachability_distance := new_r_dist; decrease(Object, new_r_dist);

END; // OrderSeeds::update

April 30,2012 21

Page 22: Optics ordering points to identify the clustering structure

Having generated the augmented cluster-ordering of a database with respect to e and MinPts,

extract any density-based clustering from this order with respect to MinPts and a

clustering- distance ε ’ ≤ε by simply “scanning” the cluster-ordering

and assigning cluster-memberships depending on the reachability- distance and the core-

distance of the objects.

That an extraction is possible only demonstrates that the cluster-ordering of adata set actually contains the information about the intrinsic clustering structure

of that data set (up to the generating distance ε) .

April 30,2012 22

Page 23: Optics ordering points to identify the clustering structure

ExtractDBSCAN-Clustering (ClusterOrderedObjs, ε’, MinPts)

ExtractDBSCAN-Clustering (ClusterOrderedObjs, ε’, MinPts)

// Precondition: ε ' ≤ generating dist ε for ClusterOrderedObjs

ClusterId := NOISE;

FOR i FROM 1 TO ClusterOrderedObjs.size DO

Object := ClusterOrderedObjs.get(i);

IF Object.reachability_distance > ε’ THEN

// UNDEFINED > ε

IF Object.core_distance ≤ ε’ THEN

ClusterId := nextId(ClusterId);

Object.clusterId := ClusterId;

ELSE

Object.clusterId := NOISE;

ELSE // Object.reachability_distance ≤ ε’

Object.clusterId := ClusterId;

END; // ExtractDBSCAN-Clustering

April 30,2012 23

Page 24: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

AI

B

J

K

L

R

M

P

N

CF

DE

GH

44

reach

seedlist:

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 24

Page 25: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

AI

B

J

K

L

R

M

P

N

CF

DE

GH

44

reach

seedlist:

AI

B

J

K

L

R

M

P

N

CF

DE

GH

A

44

core-

distance

(B,40) (I, 40)

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 25

Page 26: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

44

reach

A

44

B

AI

B

J

K

L

R

M

P

N

C

FD

EG

H

seedlist: (I, 40) (C, 40)

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 26

Page 27: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

44

reach

A

44

B

AI

B

J

K

L

R

M

P

N

CF

DE

GH

I

seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 27

Page 28: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

44

reach

A

44

B I

AI

B

J

K

L

R

M

P

N

CF

DE

GH

J

seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 28

Page 29: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

44

reach

A

44

B I J

AI

B

J

K

L

R

M

P

N

CF

DE

GH

L

seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 29

Page 30: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

AI

B

J

K

L

R

M

P

N

CF

DE

GH

seedlist: -

A B I J L M K N R P C D F G E H

44

reach

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 30

Page 31: Optics ordering points to identify the clustering structure

OPTICS ALGORITHM EXAMPLE

AI

B

J

K

L

R

M

P

N

CF

DE

GH

seedlist: -

A B I J L M K N R P C D F G E H

44

reach

• Example Database (2-dimensional, 16 points)

• ε= 44, MinPts = 3

April 30,2012 31

Page 32: Optics ordering points to identify the clustering structure

GRAPHICAL REPRESENTATION

A data set’s cluster ordering can be represented graphically.

It helps to visualize and understand the clustering structure in a data set.

April 30,2012 32

Page 33: Optics ordering points to identify the clustering structure

GRAPHICAL REPRESENTATION

In Figure

reachability plot for a simple 2-D data set, which presents a generaloverview of how the data are structured and clustered.

The data objects are plotted in the clustering order (horizontal axis) togetherwith their respective reachability-distances (vertical axis).

The three Gaussian “bumps” in the plot reflect three clusters in the data set.

April 30,2012 33

Page 34: Optics ordering points to identify the clustering structure

ALGORITHM PERFROMANCE

performed an extensive performance test using different data sets and different parameter settings.

simply turned out that the run-time of OPTICS was almost constantly 1.6 times the run-time of DBSCAN.

not surprising since the run-time for OPTICS as well as for DBSCAN is heavily dominated by

the run-time of the ε -neighborhood queries

which must be performed for each object in the database, i.e. the run-time for both algorithms is O(n * run-time of an e-neighborhood query).

April 30,2012 34

Page 35: Optics ordering points to identify the clustering structure

ALGORITHM PERFROMANCE

To retrieve the e-neighborhood of an object o, a region query with the center o and the radius e is used.

Without any index support, to answer such a region query, a scan through the whole database has to be performed.

In this case, the run-time of OPTICS would be O(n2).

If a tree-based spatial index can be used, the run-time is reduced to O (n log n)

April 30,2012 35

Page 36: Optics ordering points to identify the clustering structure

ALGORITHM PERFROMANCE

The height of such a tree-based index is O(log n) for a database of n objects in the worst case and, at least in low-dimensional spaces, a query with a “small” query region has to traverse only a limited number of paths.

Furthermore, if we have a direct access to the e-neighborhood, e.g. if the objects are organized in a grid, the run-time is further reduced to O(n) because in a grid the complexity of a single neighborhood query is O(1).

April 30,2012 36

Page 37: Optics ordering points to identify the clustering structure

CONCLUSION

OPTICS computes an augmented cluster- ordering of the database objects.

The main advantage of approach, when compared to the clustering algorithmsproposed in the literature, is that, do not limit to one global parameter setting.

Instead, the augmented cluster-ordering contains information which isequivalent to the density based clusterings corresponding to a broad range ofparameter settings and thus is a versatile basis for both automatic and interactivecluster analysis.

April 30,2012 37

Page 38: Optics ordering points to identify the clustering structure

REFERENCES

[1] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander,“OPTICS: Ordering Points To Identify the Clustering Structure” , Proc. ACMSIGMOD’99 Int. Conf. on Management of Data, Philadelphia PA, 1999.

[2] Data Mining Concepts and Techniques by Han Kamber Pei , Third Edition

[3] Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle, “Efficient Density-Based Clustering of Complex Objects“

[4] Class Lecture Slides about Density Clustering -DBSCAN

April 30,2012 38

Page 39: Optics ordering points to identify the clustering structure

THANK YOU FOR YOUR CO-OPERATION

April 30,2012 39

Page 40: Optics ordering points to identify the clustering structure

QUESTIONS??

April 30,2012 40