MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda...
-
Upload
bennett-ray -
Category
Documents
-
view
213 -
download
0
Transcript of MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda...
MOSAIC: A Proximity Graph ApproachMOSAIC: A Proximity Graph Approachfor Agglomerative Clusteringfor Agglomerative Clustering
Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti, and Christoph F. Eick
Department of Computer Science, University of Houston
Organization 1. Motivation
Scope of the research– Region Discovery– Traditional Clustering
Clustering with Plug-In Fitness Functions Shape-aware Clustering Algorithms Ideas of MOSAIC
2. Background3. The MOSAIC Algorithm4. Experimental Evalution 5. Related Work 6. Conclusion and Future Work
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.1 Motivation: Examples of Region Discovery1.1 Motivation: Examples of Region Discovery
RD-Algorithm
Application 1: Hot-spot Discovery [EVDW06]Application 2: Find Interesting Regions with respect to a Continuous VariableApplication 3: Find “representative” regions (Sampling)Application 4: Regional Co-location MiningApplication 5: Regional Association Rule Mining [DEWY06]Application 6: Regional Association Rule Scoping [EDYKN07]
Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well
=1.01
=1.04
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Region Discovery FrameworkRegion Discovery Framework
The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX interestingness(c)size(c) with >1
Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous 4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives,
and low reward clusters are frequently not reported
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.2 Clustering with Plug-In Fitness Functions1.2 Clustering with Plug-In Fitness Functions
Clustering algorithms
No fitness functionProvides plug-infitness function
Fixed Fitness
Function
DBSCANHierarchicalClustering
Implicit Fitness Function
K-MeansCHAMELEON
MOSAIC
PAM
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.3 Shape-aware Clustering1.3 Shape-aware Clustering
• Shape is a significant characteristic in traditional clustering and region discovery
• Examples
Fig. 1: some chain-like patterns in Volcano dataset
Fig.2: arbitrary shape of regions of high (low) arsenic concentration in Texas wells
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.4 Ideas Underlying MOSAIC1.4 Ideas Underlying MOSAIC
• MOSAIC provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons
Fig. 6: An illustration of MOSAIC’s approach
(a) input (b) output
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Talk OrganizationTalk Organization
1. Motivation2. Background
Representative-based clustering Agglomerative clustering Proximity Graphs
3. The MOSAIC Algorithm4. Experimental Evaluation 5. Related Work 6. Conclusion and Future Work
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.1 Representative-based Clustering
Attribute2
Attribute1
1
2
3
4
Objective: Find a set of objects OR such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Properties: Cluster shapes are convex polygonsPopular Algorithms: K-means, K-medoids, SCEC
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.2 MOSAIC and Agglomerative Clustering2.2 MOSAIC and Agglomerative Clustering
Advantages MOSAIC over traditional agglomerative clustering:
• Wider search—considers all neighboring clusters • Plug-in fitness function• Clusters are always contiguous • Expensive algorithm is only run for 20-1000 iterations• Highly generic algorithm
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.3 Proximity Graphs2.3 Proximity Graphs
• How to identify neighboring clusters for representative-based clustering algorithms?
• Proximity graphs provide various definitions of “neighbour”
NNG MST RNG GG DT
NNG = Nearest Neighbour Graph
MST = Minimum Spanning Tree
RNG = Relative Neighbourhood Graph
GG = Gabriel Graph
DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Proximity Graphs: DelaunayProximity Graphs: Delaunay
• The Delaunay Triangulation is the dual of the Voronoi diagram
• Three points are each others neighbours if their tangent sphere contains no other points
• Complete: captures all neighbouring clusters
• Expensive to compute in high dimensions
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Proximity Graphs: GabrielProximity Graphs: Gabriel
• The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed)
• Points are neighbours only if their (diametral) sphere of influence is empty
• Can be computed more efficiently: O(k3)
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
3. MOSAIC3. MOSAIC
Fig. 10: Gabriel graph for clusters generated by a representative-based clustering algorithm
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Pseudo Code MOSAICPseudo Code MOSAIC
1. Run a representative-based clustering algorithm to create a large number of clusters.2. Read the representatives of the obtained clusters.3. Create a merge candidate relation using proximity graphs.4. WHILE there are merge-candidates (Ci ,Cj) left BEGIN Merge the pair of merge-candidates (Ci,Cj), that enhances fitness function q the most, into a new cluster C’ Update merge-candidates: C Merge-Candidate(C’,C) Merge-Candidate(Ci,C)
Merge-Candidate(Cj,C) END RETURN the best clustering X found.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Complexity MOSAICComplexity MOSAIC
Let
n be the number of objects in the dataset
k be the number of clusters returned by the representative-based algorithm
Complexity MOSAIC: O(k3 + k2*O(q(x)))
Remarks: • The above formula assumes that fitness is computed from
the scratch when a new clustering is obtained• Lower complexities can be obtained with incrementally
reusing results of previous fitness computations• Our current implementation assumes that only additive
fitness functions are used
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
4. Experimental Evaluation for Traditional Clustering4. Experimental Evaluation for Traditional Clustering
• Compared MOSAIC with DBSCAN and K-means• Used silhouette as q(X) when running MOSAIC;
Silhouette considers cohesion and separation (measured as the distance to the nearest cluster).
• Used 9-Diamonds, Volcano, Diabetes, Ionosphere, and Vehicle datasets in the experimental evaluation
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Experimental ResultsExperimental Results
• Finding good parameter setting for DBSCAN turned out to be problematic for the 9-Diamonds and Volcano spatial datasets.
• Neither DBSCAN nor MOSAIC were able to obtain to identify all chain-like patterns in the Volcano dataset.
• We compared MOSAIC and K-means for the Ionosphere, Diabetes, and Vehicle high-dimensional datasets. Cluster quality was measured using Silhouette. MOSAIC outperformed K-means on these datasets.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Volcano Dataset Result MOSAIC Volcano Dataset Result MOSAIC
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Volcano Dataset Result DBSCANVolcano Dataset Result DBSCAN
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Open Issues: What is a Good Fitness Function Open Issues: What is a Good Fitness Function for Traditional Clustering?for Traditional Clustering?
• The use plug-in fitness functions within traditional clustering algorithms is not very common.
• Use existing cluster evaluation measures as fitness function, such as cohesion, separation, and silhouette, does not lead to very good clustering when confronted with arbitrary shape clusters [Choo07].
Question: Can we find better cluster evaluation measures or is finding good evaluation measures for traditional clustering a hopeless project?
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
5. Related Work5. Related Work
• CURE integrates a partitioning algorithm with an agglomerative hierarchical algorithm [GRS98].
• CHAMELEON [KHK99] provides a sophisticated two-phased clustering algorithm: a multilevel graph partitioning algorithm and agglomerative clustering algorithm on knn sparse graph.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Related Work ContinuedRelated Work Continued
• Lin and Zhong [LC02 and ZG03] propose hybrid clustering algorithms that combine representative-based clustering and agglomerative clustering methods.
• Surdeanu [STA05] proposes a hybrid clustering approach that combines agglomerative clustering algorithm with the Expectation Maximization (EM) algorithm.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
6. Conclusion 6. Conclusion
• A new clustering algorithm was introduced that approximates arbitrary shape clusters through unions of convex polygons
• The algorithm performs a wider search by considering “all” neighboring clusters as merge candidates. Gabriel graphs are used to determine neighboring clusters
• The algorithm is generic in that it can be used with any initial merge candidate relation, any fitness function, and any representative-based algorithms
• MOSAIC can also be seen as a generalization of agglomerative grid-based clustering algorithms.
• We mainly use MOSAIC in the region discovery project mentioned earlier.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Future Work: Future Work: Learn fitness function based on feedbackLearn fitness function based on feedback
Idea: employs machine learning techniques to learn a fitness function by using the feedback of a domain expert.– Pros:
– It provides more adaptive approach to give the changes to tailor the fitness function based on the domain expert’s requirements.
– The process of finding an appropriate fitness function is automatic.
– Cons: – features selection is non-trivial
– Learning the function is a difficult machine learning task