Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile...

Density-Based Clustering of Uncertain Data (KDD2005)

Authors: Hans-Peter Kriegel and Martin PfeilePresenter: Chui Chun Kit (Mphil) [email protected] http://www.cs.hku.hk/~ckchuiSupervisor: Dr. Benjamin C.M. Kao.

HKU Department of Computer ScienceDatabase Research Seminar

18th May 2006

Presentation Outline Introduction

What is clustering? Density based similarity measurment DBSCAN

Issues from mining certain data to uncertain data Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain

objects? Theoretical foundation of changing DBSCAN to FDBSCAN

FDBSCAN From DBSCAN to FDBSCAN Computational Issues Experimental Results

Conclusions

Introduction

What is Clustering?

Problem description A set of objects A similarity measurement Discover groups of similar objects More precisely, find sets of objects

which intra-cluster similarity is high while inter-clusters similarity is relatively low.

Different Clusters Discovered by Different Similarity Measurement

Distance-based Density-based Pattern-based …etc

Density-based clustering

The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.

The clusters are separated by low object density regions (noise)

Any clusters ?

x

y

Density-based clustering

The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.

The clusters are separated by low object density regions (noise)

Density-based clustering can detect arbitrary cluster shapes

Key idea of density-based clustering Density constraint for objects to form

clusters Intuitively for each object of a cluster

the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint)

i.e The density in the neighborhood has. to exceed some threshold.

Objects not belong to any clusters are regard as noise.

Previous Works on Density Based Clustering

DBSCAN A density-based clustering algorithm Work on data with no uncertainty

Will present the uncertainty version of DBSCAN later

DBSCAN

Two important definitions of DBSCAN Core objects Directly-density reachable Density reachable (skip) Density connected (skip)

For the sake of discussion, these two definitions are skipped

DBSCANDefinition 1: Core Object Given the density constraint (µ

andε) An object o is defined as a core

object iff there are µ or more objects within theε-range of o.

Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.

DBSCANDefinition 1: Core Object

Example (µ=5 ) Is o1 a core object?

Since there are 5 objects within the ε-range of o1, o1 is a core object

o1

o2

Since there are 5 objects within the ε-range of o2, o2 is a core object too.

ε

ε

DBSCANDefinition 2: Directly-density reachable

An object p is directly-density reachable from o if the following conditions are satisfied 1st condition: o is a core object 2nd condition: d(p,o) ≤ε

DBSCANDefinition 2: Directly-density reachable

Example (µ=5 ) Question: Is o2 directly-density reachable from

o1?

1st condition: Is o1 a core object?Since there are 5 objects within the ε-range of o1, o1 is a core object

o1

2nd condition: Is d(o2,o1) ≤ε ? Yes, it is within the ε-range of o1.

ε

o2

Thus, o2 is directly-density reachable from o1

DBSCANHow it works? Brief idea… Search for clusters by checking the ε-

neighborhood of each object in the database.

If a core object o is found, a new cluster with o and it’s direct density-reachable objects is created.

DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.

DBSCAN

Example (µ=5 )

o1

ε Arbitrary pick a point, e.g. o1, check if it is a core object…

o1 is a core objectA cluster with o1 and all o1’s density reachable objects

DBSCAN continues to “expand” the cluster by adding objects which are directly density reachable from cluster objects

ε

ε

ε

ε

ε

Since a1 is not a core object, a2 is NOT direct-density reachable from a1.a2 is NOT added into the cluster

a2

a1

Pick another point for next iteration if the current cluster does not expand.

Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise

o2

Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise

From Certain Data to Uncertain Data

From certain to uncertain dataFive major issues … Why data exhibit uncertainty? How to represent / model data

uncertainty? How to represent the distance

between two uncertain objects? What is core object in uncertain

data? What is direct density-reachable in

uncertain data?

Why data exhibit uncertainty? In many modern application ranges, e.g.

the clustering of moving objects or sensor databases, only uncertain data is available.

For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.

Why data exhibit uncertainty?

In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.

Uncertain Data (Example) Somewhere in a tropical rain forest… Location tracking of a group of about 300

Chimpanzees. Implanted device reports location of a

Chimpanzee regularly. However the reported location is not precise,

it only return the area the Chimpanzee is located.

The area is called an uncertainty region Assume the probability that the Chimpanzee

located in any location inside the uncertainty region is the same.

Uncertain Data (Example) The Chimpanzee society is

complicated, some young Chimpanzees may gather to fight against the leader.

Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.

Uncertain Data (Example) One observation is that Chimpanzees

of the same group usually stay closely together.

Assume that one Chimpanzee belongs to one group only.

Density based clustering can help to discover the Chimpanzee groups (clusters).

Uncertain Data (Example)

Uncertainty region of 15 Chimpanzees reported by the location tracking devices(location of each Chimpanzee)

Clusters

x

y

Somewhere in the tropical rain forest…

From certain to uncertain dataFive major issues… Why data exhibit uncertainty? How to represent / model data




uncertain data?

Representing Uncertain Objects

probability

x

yProbability density functions

for 2-D objects

Probability density functions of 1-D objects

Value (e.g. temperature)

The probability that an object o is having a value between a and b can be obtained by

Representing Uncertain Objects Question: What is the distance between

ouncertain and o’uncertain?

a b

Area

valueValue (e.g. temperature)

Probability density functions of 1-D objects

From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data




uncertain data?

How to represent the distance between uncertain objects?

Distance Density Function pd(o,o’) Distance Distribution Function

Pd(o,o’)(b) Distance expectation value Ed(o,o’)

Aggregated value Information loss

Distance Density Function pd(o,o’)

Express the distance between two objects by means of a probability density function.

Let d be a distance function. Let P(a≤d(o,o’)≤b) denote the probability that

d(o,o’) is between a and b. A probability density function pd(o,o’) is called a

distance density function if the following condition holds:


probability

Distance between o and o’dis

pd(o,o’)(dis) =Pd (o,o’) (dis)

0

Probability density functions (pdf) of each uncertain data item is considered independent.

Distance density function express the distance between two uncertain objects by mean of pdf.

Value (e.g. temperature)


probability

Distance between o and o’

Distance Density Function(represents the distance between two uncertain objects) pd (o,o’)

0


probability

From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by

Distance between o and o’a b

Area = P(a≤d(o,o’)≤b)|Area | = 1

0

pd (o,o’) Maximum possible distance between o and o’

Minumum possible distance between o and o’



Pd(o,o’)(b) Distance expectation value Ed(o,o’)


Distance Distribution Function

Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b.

Useful in density-based clustering, when expressing the probability that the d(o’,o) ≤b.

2nd condition for directly density reachable in DBSCAN

Distance Distribution Function In density-based clustering, when

evaluating whether an object o’ is directly density-reachable from o, we may want to ask

o

o’

What is the probability that o and o’ are close to each other? i.e. distance between o and o’ smaller than or equal to b?

The distance distribution function Pd(o,o’)(b) is the answer.

Probability density functions (pdf)

Distance Distribution Function

probability

0

Distance Density Function

pd (o,o’)

The distance distribution function Pd(o,o’)(b) is equal to the integration of the distance density function pd(o,o’) from negative infinity to b .

b Distance between o and o’



Pd(o,o’)(b) Distance Expectation Value Ed(o,o’)


Distance Expectation Value Ed(o,o’)

Represent the distance between two uncertain objects by one numerical value.

Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E.g. DBSCAN

Disadvantage: Information loss

Average distance between two objects aggregated from the distance density function

Distance density function





uncertain data?

Theoretical Foundations ICore Object Probability Let denotes the probability

that an object o is a core object. Core object probability of an object o is

given by the following formula

We start derive this formula from the core object definition of DBSCAN…

Theoretical Foundations ICore Object Probability In DBSCAN, an object o is a core

object if the density constraint (µ andε) is satisfied.

i.e. There are µ or more objects p within the ε-range of o. (d(p,o) ≤ε)

The probability that an object o is a core object is the probability that the density constraint is satisified.

The probability that there are µ or more objects p with d(p,o) ≤ε

Theoretical Foundations ICore Object Probability

p

ε

o

Example µ=5 If ε is this large, obviously, core-object probability of o is 1

If ε is this small, what is the core object probability of o?

Sometime, d(p,o) ≤εand sometime d(p,o) ≥ε

What is the core object probability of o?

Probability density functions (pdf)

Theoretical Foundations ICore Object Probability For each subset A of the database D

which having the cardinality higher than or equal to µ.

Theoretical Foundations ICore Object Probability For each subset A of the database D

which having the cardinality higher than or equal to µ Determine the probability that only the

objects p of A with d(p,o) ≤ε but no other objects in D\A.

The probability that only the objects p of A having d(p,o) ≤ε

but no other objects in D\A

Remind that is the probability that the distance between two uncertain objects is smaller than or equal to a value b.

Theoretical Foundations ICore Object Probability

Second part :Probability that ALL objects p in D\A are NOT d(p,o) ≤ε

First part:Probability that ALL objects p in A with d(p,o) ≤ε

The probability that only the objects p of A having d(p,o) ≤ε

but no other objects in D\A





uncertain data?

Theoretical Foundations IIReachability Probability Let be the probability that

p is reachable from o. In DBSCAN, an object p is directly

density reachable form o if 1st condition : o is a core object 2nd condition : d(p,o) ≤ε

×

Incorrect, why?

The two events are Dependent to each other !These two conditions are

NOT independent!

Theoretical Foundations IIReachability Probability Example (µ=3)

×

Incorrect, why?

In this case,The probability that o is a core object is depend on the probability that d(p,o) ≤εi.e. 1st and 2nd conditions are NOT independent.

o

p

q

Probability density functions (pdf) The two events are Dependent to each other !These two conditions are

NOT independent!

ε –range of o

Theoretical Foundations IIReachability Probability Two independent conditions

×

1st conditionWe consider the core object probability in D\p.And relax the density constraint µ by 1.

o

p

q

2nd conditionWe consider the probability that d(p,o) ≤ε

p

Their product corresponds to the probability that at least µ objects o’ from D are having d(o’,o) ≤ε, and that object p is one of them.Which correspond to the definition of directly density reachable in DBSCAN

The probability that at least µ-1 objects from D\p are located within anε-range of o is

Theoretical Foundations IIReachability Probability

The probability that at least µ-1 objects from D\p are located within anε-range of o is

The probability that the distance between p and o is smaller than or equal to ε is


The two conditions are independent Their product corresponds to the

probability that at least µ objects from D are located in ε- range of o, and that p is one of them.


The probability that at least µ-1 objects from D\p having their distance with o smaller than or equal toε

The probability that the distance between p and o is smaller than or equal to ε

How does FDBSCAN works?

Traditional DBSCAN algorithm clusters a data set by always adding objects to the current cluster which are directly density reachable from the current query object o.

FDBSCAN works very similar to the traditional approach.

How does FDBSCAN works? For each uncertain object o

Check if it is a core object If yes, for each other object p

Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

There are O(|DB|2) reachability probability computations

Computational Aspect I

Computing the reachability probability

Reachability Probability

Core Object Probability

Distance Density Function

Computational Aspect IComputing

Integration

Integration

Direction 1: Avoid calculating the integration

Sampling Monte-carlo sampling Each uncertain object o is represented by a

sequence of s sample points. i.e. <o1,o2,…os>

Compute base on the sample sequences.

How it can be done? (If time allowed)

Computational Aspect IComputing

Computational Aspect IComputing Direction 2: Reduce the number of

reachability probability computations. Some objects maybe located very far away

from o, which is obviously no chance to be directly density-reachable from o.

Use MBRs to bound the object samples Compute for all objects o, the MBR(o) bounding

the sample points <o1,o2,…os> If MBR(p) is outside theε- range of o, p must NOT

be direct density-reachable from o.

Computational Aspect II (If time allows)

Computing Core Object Probability

Interesting, but complicated, click here to skip!

Computational Aspect IIComputing Core Object Probability

Two issues 1st issue : There are many core

object probability computations. 2nd issue : In each core object

probability computation, we have to consider (in |DB|) exponentially many subsets A of DB.

1st Issue : Many Core object Probability Computations

For each uncertain object o Check the probability that o is a core

object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

, for all p in D

The 1st condition of reachability probability is a core object probability

2nd Issue: Exponentially many subsets to consider for each core-object value

Furthermore, the computation of core-object values has to consider (in |DB|) exponentially many subsets A of DB.

For all subsets A in D with cardinality greater than or equal to µ

2nd Issue: Exponentially many subsets to consider for each core-object value

Sampling Monte-carlo sampling Each uncertain object o is

represented by a sequence of s sample points. i.e. <o1,o2,…os>

Compute base on the sample sequences.

How it can be done?

Compute base on the sample sequences s is the sample rate. <o1,o2,…os> Determine the core-object probability

base on s 2 meaningful samples. oj is called the j th instance of o. Dj is the collection of j th instance of all

objects in D. E.g. s=5

a1, a2, a3, a4, a5 b1, b2, b3, b4, b5 c1, c2, c3, c4, c5 d1, d2, d3, d4, d5

D1 = {a1,b1,c1,d1,e1}D2 ={a2,b2,c2,d2,e2}…

Compute base on the sample sequences If we want to compute the core object probability

of o, create a s×s sample matrix M(o) M(o) keep track of the information for deducing

With some modification, it can be used to deduce

Each cell mi,j of M(o) indicates the number of ε-neighbors of oi in Dk.

Create sample matrix M(o) (skip) Each cell mi,j contains the number of ε-

neighbor of object sample oi in database instance Dj.

Dj consists of all other objects’ j-th sample (excluding oj)

Create sample matrix M(o) (Example: Sample rate=3, µ = 5)

o

a

b

c

d

o3o2

o1

o is the query object

All object samples are bounded by MBRs

Sample rate=3

µ = 5

o

a

b

c

d

o3o2

o1


Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

We are going fill m1,1

Since o1 itself is also counted, it is initialized to 1.

How many ε-neighbors of o1 in D1?

By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1

By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1

By MBR pruning, we are sure these three objects contain ε-neighbors of o1 in D1

4

MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences

MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences

We are going fill m1,1b1 and a1 are ε-neighbors

61

Although b2 is ε-neighbor of o1, it is not counted as it is NOT in database instance 1.6 is the final value. This indicates that there are 6 ε-neighbors of object sample o1 in database instance D1.


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4


o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

Build M(o)

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

Now we have the sample matrix M(o).

Compute base on the sample matrix M(o), (µ = 5)

For each uncertain object o Check the probability that o is a core

object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form

a cluster

Core object probability 1st Step: Count the number of elements

in the sample matrix M(o) which contain values higher than or equal to µ

2nd Step: Normalize the value by s^2 yields


inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

1st Step: Count = 6

2nd Step: Core-object probability of o = 6 / 9

Since the core object probability is > 0.5, o is treated as a core-object


For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5

For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a

cluster

The first partCan be derived from M(o)

The second partCan do some pruning using the object samples’ MBRs

1st step: Decrease the values mi,j by 1 for which d(oi,pj)≤εholds.

2nd step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1.

3rd step: Normalizing the number by s2 yield the probability

Compute The first part

Computing the first part

Conceptually, M(o) contains the ε-neighbor information in D, we want it contains the information in D\a.

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

1st Step: decrease the values mi,j by 1 for which d(oi,pj)≤εholds.

5 4

Decrease m1,1 and m1,3 by 1Decrease m2,1 and m2,3 by 1

5 4

Decrease m3,3 by 1

4

Computing the first part

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

inst

ance

s o

f o 1

2

3

database instances

1 2 3

5

4

4 4

3rd Step: Since all the cell are greater than or equal to 5-1 =4, the first part probability is equal to 9/9 = 1

5 4

5 4

4

2nd Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1

Count the number of events d(oi,pj)≤ε, and by normalizing the number by s×s.

The MBRs of the object samples can be used for pruning.

Compute The second part

1st Step: Count the Number of events d(oi,pj)≤ε

Count =

Computing the second part

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3

2nd Step: Normalize the count by s^2.The reachability probability of a from o is 5/9.

2 + 2 + 1

= 5

= 1 × 5/9 = 5/9

Since ≥ 0.5, p is directly density reachable from o.

p and o form a cluster.

Reachability of a from o

Reachability of other objects from o

o

a

b

c

d

o3o2

o1b1

b2

b3a1

a2

a3inst

ance

s o

f o 1

2

3

database instances

1 2 3

6 5 5

6 4 5

4 4 5

Experimental Evaluation

Experimental Evaluation Datasets

Artificial data set (ART) 1000 2-dimensional objects which are normally

distributed in [0,1] Each object is randomly surrounded by a box having

a side length of p<1 in each dimension (Data fuzziness)

Assume uniform probability distribution within the box

Engineering data set (PLANE) 5000 42-dimensional objects Normalized


Implementation FDBSCAN EXPDBSCAN

Represent the distance between two uncertain objects by a single distance expectation value Ed(o,o’).

Use the traditional DBSCAN algorithm to mine the data.


Implementation Java 1.4 Window platform 730 MHz processor 512 MB main memory Sample rate s = 5

Experiment 1Efficiency of the FDBSCAN Measure the runtimes of FDBSCAN and

EXPDBSCAN on ART dataset p=0.01

Little fuzziness in the datasets

Runtime (s)

Does EXPDBSCAN applied MBR pruning strategies as FDBSCAN?

Experiment 2 Effectively of FDBSCAN

Measures the relation between the quality of the cluster results and data fuzziness of FDBSCAN and EXPDBSCAN.

How to measure the quality of clusters? Treat as a black box for the time being… Good cluster will have the quality value

close to 1, vice versa

Experiment 2 Effectively of FDBSCAN

FDBSCAN returns clusters with better quality than EXPDBSCAN in all data fuzziness and number of dimensions. i.e. more effective

In ART, EXPDBSCAN performs quite well, but for high dimensional data, its quality is much worse than the FDBSCAN approach.

The quality of EXPDBSCAN and FDBSCAN fall in high data fuzziness, however, the degree of falling of FDBSCAN is smaller than EXPDBSCAN.

Experiment 3Accuracy of the core object classification

How accurate do FDBSCAN and EXPDBSCAN classify core object?

Precision and recall rate of core object Precision shows how precise the reported

core set of core objects is. # reported real core objects / #of core objects

reported Recall shows the percentage of real core

objects reported. #reported real core objects/ total # of real core

objects in D

Experiment 3Accuracy of the core object classification

FDBSCAN has a higher precision and recall rate of core object in 2D ART dataset.

Very few real core objects are found for EXPDBSCAN, however nearly most of the returned core objects are real core objects

The precision and recall rate are not 100% because FDBSCAN use sampling approach for calculating the core object probability

The precision and recall rate of FDBSCAN increases in high dimension. Why?

EXPDBSCAN has a lower recall rate than FDBSCAN. Why?

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)

B

Probability density functionGaussian Distribution

Core point candidates

A

1

2

3 4

5

6

7

8

9

10

Why EXPDBSCAN suffer from low recall rate? (Example µ=5)

A

B

ε

εNumber of ε-neighbor = 5A is a core object

Number of ε-neighbor = 4B is NOT a core object

1

2

3 4

5

6

7

8

9

10

Conclusion Demonstrated how density based

clustering can be carried out based on uncertain information.

Presented the theoretical foundations for density based clustering of uncertain data.

FDBSCAN work on the fuzzy distance function directly instead of working on lossy aggregated information.

My comments We also want to know…

The relationship between the sample rate and the execution time, a higher sample rate should suggest a more accurate result, but generally it tradeoffs with execution time. What is the relationship between these two parameters?

Sample rate vs cluster quality Sample rate vs data dimensionality, which is a

reference to determine the sample rate based on the data characteristic

Sample rate vs fuzziness of data Since we represent each uncertain object by MBRs, the

MBR(o) are bounding the samples of o This means that the MBR(o) may not bounding the whole

uncertainty region of o In high data fuzziness, MBR(o) may not precisely indicate

the uncertainty region of the real object o.

Something confused…

We also want to know… The reason for using 0.5 probability to

determine core object is questionable. Why don’t treat this as a parameter?

A higher value should suggests more false negative core objects, a lower value suggests more false positive core objects.

The End

Thank you

Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile...

Documents

Transcript of Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile...