Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile...
-
Upload
brianna-parsons -
Category
Documents
-
view
213 -
download
0
Transcript of Density-Based Clustering of Uncertain Data (KDD2005) Authors: Hans-Peter Kriegel and Martin Pfeile...
Density-Based Clustering of Uncertain Data (KDD2005)
Authors: Hans-Peter Kriegel and Martin PfeilePresenter: Chui Chun Kit (Mphil) [email protected] http://www.cs.hku.hk/~ckchuiSupervisor: Dr. Benjamin C.M. Kao.
HKU Department of Computer ScienceDatabase Research Seminar
18th May 2006
Presentation Outline Introduction
What is clustering? Density based similarity measurment DBSCAN
Issues from mining certain data to uncertain data Why data exhibit uncertainty? How to represent / model data uncertainty? How to represent the distance between two uncertain
objects? Theoretical foundation of changing DBSCAN to FDBSCAN
FDBSCAN From DBSCAN to FDBSCAN Computational Issues Experimental Results
Conclusions
Introduction
What is Clustering?
Problem description A set of objects A similarity measurement Discover groups of similar objects More precisely, find sets of objects
which intra-cluster similarity is high while inter-clusters similarity is relatively low.
Different Clusters Discovered by Different Similarity Measurement
Distance-based Density-based Pattern-based …etc
Density-based clustering
The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.
The clusters are separated by low object density regions (noise)
Any clusters ?
x
y
Density-based clustering
The main reason why we recognize the clusters is that within each cluster we have a typical density of objects which is considerably higher than outside of the cluster.
The clusters are separated by low object density regions (noise)
Density-based clustering can detect arbitrary cluster shapes
Key idea of density-based clustering Density constraint for objects to form
clusters Intuitively for each object of a cluster
the neighborhood of a given radius has to contain at least a minimum number of objects. (density constraint)
i.e The density in the neighborhood has. to exceed some threshold.
Objects not belong to any clusters are regard as noise.
Previous Works on Density Based Clustering
DBSCAN A density-based clustering algorithm Work on data with no uncertainty
Will present the uncertainty version of DBSCAN later
DBSCAN
Two important definitions of DBSCAN Core objects Directly-density reachable Density reachable (skip) Density connected (skip)
For the sake of discussion, these two definitions are skipped
DBSCANDefinition 1: Core Object Given the density constraint (µ
andε) An object o is defined as a core
object iff there are µ or more objects within theε-range of o.
Basically, we can conduct a range search on object o with radius ε, if there are µ or more objects returned, then o is a core object.
DBSCANDefinition 1: Core Object
Example (µ=5 ) Is o1 a core object?
Since there are 5 objects within the ε-range of o1, o1 is a core object
o1
o2
Since there are 5 objects within the ε-range of o2, o2 is a core object too.
ε
ε
DBSCANDefinition 2: Directly-density reachable
An object p is directly-density reachable from o if the following conditions are satisfied 1st condition: o is a core object 2nd condition: d(p,o) ≤ε
DBSCANDefinition 2: Directly-density reachable
Example (µ=5 ) Question: Is o2 directly-density reachable from
o1?
1st condition: Is o1 a core object?Since there are 5 objects within the ε-range of o1, o1 is a core object
o1
2nd condition: Is d(o2,o1) ≤ε ? Yes, it is within the ε-range of o1.
ε
o2
Thus, o2 is directly-density reachable from o1
DBSCANHow it works? Brief idea… Search for clusters by checking the ε-
neighborhood of each object in the database.
If a core object o is found, a new cluster with o and it’s direct density-reachable objects is created.
DBSCAN iteratively collects the directly density-reachable objects from the objects in the cluster.
DBSCAN
Example (µ=5 )
o1
ε Arbitrary pick a point, e.g. o1, check if it is a core object…
o1 is a core objectA cluster with o1 and all o1’s density reachable objects
DBSCAN continues to “expand” the cluster by adding objects which are directly density reachable from cluster objects
ε
ε
ε
ε
ε
Since a1 is not a core object, a2 is NOT direct-density reachable from a1.a2 is NOT added into the cluster
a2
a1
Pick another point for next iteration if the current cluster does not expand.
Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise
o2
Eventually, clusters are formedObjects that not assigned to any clusters are regarded as noise
From Certain Data to Uncertain Data
From certain to uncertain dataFive major issues … Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
From certain to uncertain dataFive major issues … Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
Why data exhibit uncertainty? In many modern application ranges, e.g.
the clustering of moving objects or sensor databases, only uncertain data is available.
For instance, in the area of mobile services, the objects continuously change their positions so that exact positional information is often not available.
Why data exhibit uncertainty?
In application areas such as clustering of distributed feature vectors, due to security aspects or to limited bandwidth, only approximated information is transmitted to a central server site.
Uncertain Data (Example) Somewhere in a tropical rain forest… Location tracking of a group of about 300
Chimpanzees. Implanted device reports location of a
Chimpanzee regularly. However the reported location is not precise,
it only return the area the Chimpanzee is located.
The area is called an uncertainty region Assume the probability that the Chimpanzee
located in any location inside the uncertainty region is the same.
Uncertain Data (Example) The Chimpanzee society is
complicated, some young Chimpanzees may gather to fight against the leader.
Zoologists are interested to study the factors that affect the formation of different groups (clusters) inside the Chimpanzee society.
Uncertain Data (Example) One observation is that Chimpanzees
of the same group usually stay closely together.
Assume that one Chimpanzee belongs to one group only.
Density based clustering can help to discover the Chimpanzee groups (clusters).
Uncertain Data (Example)
Uncertainty region of 15 Chimpanzees reported by the location tracking devices(location of each Chimpanzee)
Clusters
x
y
Somewhere in the tropical rain forest…
From certain to uncertain dataFive major issues… Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
Representing Uncertain Objects
probability
x
yProbability density functions
for 2-D objects
Probability density functions of 1-D objects
Value (e.g. temperature)
The probability that an object o is having a value between a and b can be obtained by
Representing Uncertain Objects Question: What is the distance between
ouncertain and o’uncertain?
a b
Area
valueValue (e.g. temperature)
Probability density functions of 1-D objects
From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
How to represent the distance between uncertain objects?
Distance Density Function pd(o,o’) Distance Distribution Function
Pd(o,o’)(b) Distance expectation value Ed(o,o’)
Aggregated value Information loss
How to represent the distance between uncertain objects?
Distance Density Function pd(o,o’) Distance Distribution Function
Pd(o,o’)(b) Distance expectation value Ed(o,o’)
Aggregated value Information loss
Distance Density Function pd(o,o’)
Express the distance between two objects by means of a probability density function.
Let d be a distance function. Let P(a≤d(o,o’)≤b) denote the probability that
d(o,o’) is between a and b. A probability density function pd(o,o’) is called a
distance density function if the following condition holds:
Distance Density Function pd(o,o’)
probability
Distance between o and o’dis
pd(o,o’)(dis) =Pd (o,o’) (dis)
0
Probability density functions (pdf) of each uncertain data item is considered independent.
Distance density function express the distance between two uncertain objects by mean of pdf.
Value (e.g. temperature)
Distance Density Function pd(o,o’)
probability
Distance between o and o’
Distance Density Function(represents the distance between two uncertain objects) pd (o,o’)
0
Distance Density Function pd(o,o’)
probability
From the distance density function, the probability that the distance between two uncertain objects is between a and b is given by
Distance between o and o’a b
Area = P(a≤d(o,o’)≤b)|Area | = 1
0
pd (o,o’) Maximum possible distance between o and o’
Minumum possible distance between o and o’
How to represent the distance between uncertain objects?
Distance Density Function pd(o,o’) Distance Distribution Function
Pd(o,o’)(b) Distance expectation value Ed(o,o’)
Aggregated value Information loss
Distance Distribution Function
Captures the probability that the distance between two uncertain objects is smaller than or equal to a value b.
Useful in density-based clustering, when expressing the probability that the d(o’,o) ≤b.
2nd condition for directly density reachable in DBSCAN
Distance Distribution Function In density-based clustering, when
evaluating whether an object o’ is directly density-reachable from o, we may want to ask
o
o’
What is the probability that o and o’ are close to each other? i.e. distance between o and o’ smaller than or equal to b?
The distance distribution function Pd(o,o’)(b) is the answer.
Probability density functions (pdf)
Distance Distribution Function
probability
0
Distance Density Function
pd (o,o’)
The distance distribution function Pd(o,o’)(b) is equal to the integration of the distance density function pd(o,o’) from negative infinity to b .
b Distance between o and o’
How to represent the distance between uncertain objects?
Distance Density Function pd(o,o’) Distance Distribution Function
Pd(o,o’)(b) Distance Expectation Value Ed(o,o’)
Aggregated value Information loss
Distance Expectation Value Ed(o,o’)
Represent the distance between two uncertain objects by one numerical value.
Advantage: Since the distance between two uncertain objects is represented by a single value, traditional clustering algorithms work. E.g. DBSCAN
Disadvantage: Information loss
Average distance between two objects aggregated from the distance density function
Distance density function
From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
Theoretical Foundations ICore Object Probability Let denotes the probability
that an object o is a core object. Core object probability of an object o is
given by the following formula
We start derive this formula from the core object definition of DBSCAN…
Theoretical Foundations ICore Object Probability In DBSCAN, an object o is a core
object if the density constraint (µ andε) is satisfied.
i.e. There are µ or more objects p within the ε-range of o. (d(p,o) ≤ε)
The probability that an object o is a core object is the probability that the density constraint is satisified.
The probability that there are µ or more objects p with d(p,o) ≤ε
Theoretical Foundations ICore Object Probability
p
ε
o
Example µ=5 If ε is this large, obviously, core-object probability of o is 1
If ε is this small, what is the core object probability of o?
Sometime, d(p,o) ≤εand sometime d(p,o) ≥ε
What is the core object probability of o?
Probability density functions (pdf)
Theoretical Foundations ICore Object Probability For each subset A of the database D
which having the cardinality higher than or equal to µ.
Theoretical Foundations ICore Object Probability For each subset A of the database D
which having the cardinality higher than or equal to µ Determine the probability that only the
objects p of A with d(p,o) ≤ε but no other objects in D\A.
The probability that only the objects p of A having d(p,o) ≤ε
but no other objects in D\A
Remind that is the probability that the distance between two uncertain objects is smaller than or equal to a value b.
Theoretical Foundations ICore Object Probability
Second part :Probability that ALL objects p in D\A are NOT d(p,o) ≤ε
First part:Probability that ALL objects p in A with d(p,o) ≤ε
The probability that only the objects p of A having d(p,o) ≤ε
but no other objects in D\A
From certain to uncertain data Five major issues … Why data exhibit uncertainty? How to represent / model data
uncertainty? How to represent the distance
between two uncertain objects? What is core object in uncertain
data? What is direct density-reachable in
uncertain data?
Theoretical Foundations IIReachability Probability Let be the probability that
p is reachable from o. In DBSCAN, an object p is directly
density reachable form o if 1st condition : o is a core object 2nd condition : d(p,o) ≤ε
×
Incorrect, why?
The two events are Dependent to each other !These two conditions are
NOT independent!
Theoretical Foundations IIReachability Probability Example (µ=3)
×
Incorrect, why?
In this case,The probability that o is a core object is depend on the probability that d(p,o) ≤εi.e. 1st and 2nd conditions are NOT independent.
o
p
q
Probability density functions (pdf) The two events are Dependent to each other !These two conditions are
NOT independent!
ε –range of o
Theoretical Foundations IIReachability Probability Two independent conditions
×
1st conditionWe consider the core object probability in D\p.And relax the density constraint µ by 1.
o
p
q
2nd conditionWe consider the probability that d(p,o) ≤ε
p
Their product corresponds to the probability that at least µ objects o’ from D are having d(o’,o) ≤ε, and that object p is one of them.Which correspond to the definition of directly density reachable in DBSCAN
The probability that at least µ-1 objects from D\p are located within anε-range of o is
Theoretical Foundations IIReachability Probability
The probability that at least µ-1 objects from D\p are located within anε-range of o is
The probability that the distance between p and o is smaller than or equal to ε is
Theoretical Foundations IIReachability Probability
The two conditions are independent Their product corresponds to the
probability that at least µ objects from D are located in ε- range of o, and that p is one of them.
Theoretical Foundations IIReachability Probability
The probability that at least µ-1 objects from D\p having their distance with o smaller than or equal toε
The probability that the distance between p and o is smaller than or equal to ε
How does FDBSCAN works?
Traditional DBSCAN algorithm clusters a data set by always adding objects to the current cluster which are directly density reachable from the current query object o.
FDBSCAN works very similar to the traditional approach.
How does FDBSCAN works? For each uncertain object o
Check if it is a core object If yes, for each other object p
Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form
a cluster
There are O(|DB|2) reachability probability computations
Computational Aspect I
Computing the reachability probability
Reachability Probability
Core Object Probability
Distance Density Function
Computational Aspect IComputing
Integration
Integration
Direction 1: Avoid calculating the integration
Sampling Monte-carlo sampling Each uncertain object o is represented by a
sequence of s sample points. i.e. <o1,o2,…os>
Compute base on the sample sequences.
How it can be done? (If time allowed)
Computational Aspect IComputing
Computational Aspect IComputing Direction 2: Reduce the number of
reachability probability computations. Some objects maybe located very far away
from o, which is obviously no chance to be directly density-reachable from o.
Use MBRs to bound the object samples Compute for all objects o, the MBR(o) bounding
the sample points <o1,o2,…os> If MBR(p) is outside theε- range of o, p must NOT
be direct density-reachable from o.
Computational Aspect II (If time allows)
Computing Core Object Probability
Interesting, but complicated, click here to skip!
Computational Aspect IIComputing Core Object Probability
Two issues 1st issue : There are many core
object probability computations. 2nd issue : In each core object
probability computation, we have to consider (in |DB|) exponentially many subsets A of DB.
1st Issue : Many Core object Probability Computations
For each uncertain object o Check the probability that o is a core
object Core object probability ≥ 0.5
For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form
a cluster
, for all p in D
The 1st condition of reachability probability is a core object probability
2nd Issue: Exponentially many subsets to consider for each core-object value
Furthermore, the computation of core-object values has to consider (in |DB|) exponentially many subsets A of DB.
For all subsets A in D with cardinality greater than or equal to µ
2nd Issue: Exponentially many subsets to consider for each core-object value
Sampling Monte-carlo sampling Each uncertain object o is
represented by a sequence of s sample points. i.e. <o1,o2,…os>
Compute base on the sample sequences.
How it can be done?
Compute base on the sample sequences s is the sample rate. <o1,o2,…os> Determine the core-object probability
base on s 2 meaningful samples. oj is called the j th instance of o. Dj is the collection of j th instance of all
objects in D. E.g. s=5
a1, a2, a3, a4, a5 b1, b2, b3, b4, b5 c1, c2, c3, c4, c5 d1, d2, d3, d4, d5
D1 = {a1,b1,c1,d1,e1}D2 ={a2,b2,c2,d2,e2}…
Compute base on the sample sequences If we want to compute the core object probability
of o, create a s×s sample matrix M(o) M(o) keep track of the information for deducing
With some modification, it can be used to deduce
Each cell mi,j of M(o) indicates the number of ε-neighbors of oi in Dk.
Create sample matrix M(o) (skip) Each cell mi,j contains the number of ε-
neighbor of object sample oi in database instance Dj.
Dj consists of all other objects’ j-th sample (excluding oj)
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1
o is the query object
All object samples are bounded by MBRs
Sample rate=3
µ = 5
o
a
b
c
d
o3o2
o1
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
We are going fill m1,1
Since o1 itself is also counted, it is initialized to 1.
How many ε-neighbors of o1 in D1?
By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1
By min-max dist, we are sure these three objects contain ε-neighbors of o1 in D1
By MBR pruning, we are sure these three objects contain ε-neighbors of o1 in D1
4
MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences
MBR(b) and MBR(a) cannot be prunedRetrieve their sample sequences
We are going fill m1,1b1 and a1 are ε-neighbors
61
Although b2 is ε-neighbor of o1, it is not counted as it is NOT in database instance 1.6 is the final value. This indicates that there are 6 ε-neighbors of object sample o1 in database instance D1.
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4 4
Create sample matrix M(o) (Example: Sample rate=3, µ = 5)
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
Build M(o)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4 4 5
Now we have the sample matrix M(o).
Compute base on the sample matrix M(o), (µ = 5)
For each uncertain object o Check the probability that o is a core
object Core object probability ≥ 0.5
For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form
a cluster
Core object probability 1st Step: Count the number of elements
in the sample matrix M(o) which contain values higher than or equal to µ
2nd Step: Normalize the value by s^2 yields
Compute base on the sample matrix M(o), (µ = 5)
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4 4 5
1st Step: Count = 6
2nd Step: Core-object probability of o = 6 / 9
Since the core object probability is > 0.5, o is treated as a core-object
Compute base on the sample matrix M(o), (µ = 5)
For each uncertain object o Check the probability that o is a core object Core object probability ≥ 0.5
For each other object p Check the reachability of p from o If the reachability probability ≥ 0.5, p and o form a
cluster
The first partCan be derived from M(o)
The second partCan do some pruning using the object samples’ MBRs
1st step: Decrease the values mi,j by 1 for which d(oi,pj)≤εholds.
2nd step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1.
3rd step: Normalizing the number by s2 yield the probability
Compute The first part
Computing the first part
Conceptually, M(o) contains the ε-neighbor information in D, we want it contains the information in D\a.
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4 4 5
1st Step: decrease the values mi,j by 1 for which d(oi,pj)≤εholds.
5 4
Decrease m1,1 and m1,3 by 1Decrease m2,1 and m2,3 by 1
5 4
Decrease m3,3 by 1
4
Computing the first part
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
inst
ance
s o
f o 1
2
3
database instances
1 2 3
5
4
4 4
3rd Step: Since all the cell are greater than or equal to 5-1 =4, the first part probability is equal to 9/9 = 1
5 4
5 4
4
2nd Step: Count the number of elements in the sample matrix M(o) which contain values higher than or equal to µ-1
Count the number of events d(oi,pj)≤ε, and by normalizing the number by s×s.
The MBRs of the object samples can be used for pruning.
Compute The second part
1st Step: Count the Number of events d(oi,pj)≤ε
Count =
Computing the second part
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3
2nd Step: Normalize the count by s^2.The reachability probability of a from o is 5/9.
2 + 2 + 1
= 5
= 1 × 5/9 = 5/9
Since ≥ 0.5, p is directly density reachable from o.
p and o form a cluster.
Reachability of a from o
Reachability of other objects from o
o
a
b
c
d
o3o2
o1b1
b2
b3a1
a2
a3inst
ance
s o
f o 1
2
3
database instances
1 2 3
6 5 5
6 4 5
4 4 5
Experimental Evaluation
Experimental Evaluation Datasets
Artificial data set (ART) 1000 2-dimensional objects which are normally
distributed in [0,1] Each object is randomly surrounded by a box having
a side length of p<1 in each dimension (Data fuzziness)
Assume uniform probability distribution within the box
Engineering data set (PLANE) 5000 42-dimensional objects Normalized
Experimental Evaluation
Implementation FDBSCAN EXPDBSCAN
Represent the distance between two uncertain objects by a single distance expectation value Ed(o,o’).
Use the traditional DBSCAN algorithm to mine the data.
Experimental Evaluation
Implementation Java 1.4 Window platform 730 MHz processor 512 MB main memory Sample rate s = 5
Experiment 1Efficiency of the FDBSCAN Measure the runtimes of FDBSCAN and
EXPDBSCAN on ART dataset p=0.01
Little fuzziness in the datasets
Runtime (s)
Does EXPDBSCAN applied MBR pruning strategies as FDBSCAN?
Experiment 2 Effectively of FDBSCAN
Measures the relation between the quality of the cluster results and data fuzziness of FDBSCAN and EXPDBSCAN.
How to measure the quality of clusters? Treat as a black box for the time being… Good cluster will have the quality value
close to 1, vice versa
Experiment 2 Effectively of FDBSCAN
FDBSCAN returns clusters with better quality than EXPDBSCAN in all data fuzziness and number of dimensions. i.e. more effective
In ART, EXPDBSCAN performs quite well, but for high dimensional data, its quality is much worse than the FDBSCAN approach.
The quality of EXPDBSCAN and FDBSCAN fall in high data fuzziness, however, the degree of falling of FDBSCAN is smaller than EXPDBSCAN.
Experiment 3Accuracy of the core object classification
How accurate do FDBSCAN and EXPDBSCAN classify core object?
Precision and recall rate of core object Precision shows how precise the reported
core set of core objects is. # reported real core objects / #of core objects
reported Recall shows the percentage of real core
objects reported. #reported real core objects/ total # of real core
objects in D
Experiment 3Accuracy of the core object classification
FDBSCAN has a higher precision and recall rate of core object in 2D ART dataset.
Very few real core objects are found for EXPDBSCAN, however nearly most of the returned core objects are real core objects
The precision and recall rate are not 100% because FDBSCAN use sampling approach for calculating the core object probability
The precision and recall rate of FDBSCAN increases in high dimension. Why?
EXPDBSCAN has a lower recall rate than FDBSCAN. Why?
Why EXPDBSCAN suffer from low recall rate? (Example µ=5)
B
Probability density functionGaussian Distribution
Core point candidates
A
1
2
3 4
5
6
7
8
9
10
Why EXPDBSCAN suffer from low recall rate? (Example µ=5)
A
B
ε
εNumber of ε-neighbor = 5A is a core object
Number of ε-neighbor = 4B is NOT a core object
1
2
3 4
5
6
7
8
9
10
Conclusion Demonstrated how density based
clustering can be carried out based on uncertain information.
Presented the theoretical foundations for density based clustering of uncertain data.
FDBSCAN work on the fuzzy distance function directly instead of working on lossy aggregated information.
My comments We also want to know…
The relationship between the sample rate and the execution time, a higher sample rate should suggest a more accurate result, but generally it tradeoffs with execution time. What is the relationship between these two parameters?
Sample rate vs cluster quality Sample rate vs data dimensionality, which is a
reference to determine the sample rate based on the data characteristic
Sample rate vs fuzziness of data Since we represent each uncertain object by MBRs, the
MBR(o) are bounding the samples of o This means that the MBR(o) may not bounding the whole
uncertainty region of o In high data fuzziness, MBR(o) may not precisely indicate
the uncertainty region of the real object o.
Something confused…
We also want to know… The reason for using 0.5 probability to
determine core object is questionable. Why don’t treat this as a parameter?
A higher value should suggests more false negative core objects, a lower value suggests more false positive core objects.
The End
Thank you