CACTUS-Clustering Categorical Data Using Summaries
-
Upload
september-harrington -
Category
Documents
-
view
32 -
download
0
description
Transcript of CACTUS-Clustering Categorical Data Using Summaries
![Page 1: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/1.jpg)
CACTUS-Clustering Categorical Data Using Summaries
Advisor : Dr. HsuGraduate : Min-Hung Lin
IDSL seminar 2001/10/30
![Page 2: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/2.jpg)
Outline Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments
![Page 3: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/3.jpg)
Motivation Clustering with categorical
attributes has received attention Previous algorithms do not give a
formal description of the clusters Some of them need post-process
the output of the algorithm to identify the final clusters.
![Page 4: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/4.jpg)
Objective Introduce a novel formalization of a cl
uster for categorical attributes. Describe a fast summarization-based
algorithm CACTUS that discovers clusters.
Evaluate the performance of CACTUS on synthetic and real datasets.
![Page 5: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/5.jpg)
Related Work EM algorithm [Dempster et al., 1977]
Iterative clustering technique STIRR algorithm[Gibson et al., 1998]
Iterative algorithm based on non-linear dynamical systems
ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm
![Page 6: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/6.jpg)
DEF:Support
![Page 7: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/7.jpg)
DEF:Strongly Connected
![Page 8: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/8.jpg)
DEF:Strongly Connected(cont’d)
![Page 9: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/9.jpg)
Formal Definition of a Cluster
![Page 10: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/10.jpg)
Formal Definition of a Cluster (cont’d) is the cluster-projection of C on C is called a sub-cluster if it
satisfies conditions (1) and (3) A cluster C over a subset of all
attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster
![Page 11: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/11.jpg)
DEF:Similarity
![Page 12: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/12.jpg)
Inter-attribute Summaries
![Page 13: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/13.jpg)
Intra-attribute Summaries
![Page 14: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/14.jpg)
Experiments
![Page 15: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/15.jpg)
Result STIRR fails to discover
clusters consisting of overlapping cluster-projections on any attribute
clusters where two or more clusters share the same cluster projection
CACTUS correctly discovers all clusters
![Page 16: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/16.jpg)
CACTUS Three-phase clustering algorithm
Summarization Phase Compute the summary information
Clustering Phase Discover a set of candidate clusters
Validation Phase Determine the actual set of clusters
![Page 17: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/17.jpg)
Summarization Phase Inter-attribute Summaries
Intra-attribute Summaries
![Page 18: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/18.jpg)
Clustering Phase Computing cluster-projections on
attributes Level-wise synthesis of clusters
![Page 19: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/19.jpg)
Computing Cluster-Projections on Attributes Step 1 :pairwise cluster-projection
Step 2 :intersection
![Page 20: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/20.jpg)
Computing Cluster-Projections on Attributes (cont’d)
Cluster-projection
![Page 21: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/21.jpg)
Level-wise synthesis of clusters
n
![Page 22: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/22.jpg)
Level-wise synthesis of clusters (cont’d) Generation procedure
![Page 23: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/23.jpg)
Level-wise synthesis of clusters (cont’d)
Candidate cluster
![Page 24: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/24.jpg)
Validation Some of the candidate clusters may not hav
e enough support because some of the 2-cluster may be due to different sets of tuples.
Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster.
Only clusters whose support on D passes the threshold are retained.
![Page 25: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/25.jpg)
Validation Procedure Setting the supports of all candidate c
lusters to zero. For each tuple increment the sup
port of the candidate cluster to which t belongs.
At the end of the scan, delete all candidate clusters whose support is less than the threshold.
![Page 26: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/26.jpg)
Extensions Large Attribute Value Domains Clusters in Subspaces
![Page 27: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/27.jpg)
Performance Evaluation Evaluation of CACTUS on Synthetic an
d Real Datasets Compared the performance of CACTU
S with the performance of STIRR
![Page 28: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/28.jpg)
Synthetic Datasets The test datasets were generated usin
g the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)
![Page 29: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/29.jpg)
Real Datasets Two sets of bibliographic entries
7766 entries are database-related 30919 entries are theory-related
Four attributes: the first author, the second author, the conference, and the year.
Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}
![Page 30: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/30.jpg)
Real Datasets (cont’d)
Database-relatedTheory-related
Mixture
![Page 31: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/31.jpg)
Results CACTUS is very fast and scalable(only
two scans of the dataset) CACTUS outperforms STIRR by a facto
r between 3 and 10
![Page 32: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/32.jpg)
Conclusions Formalized the definition of a cluster f
or categorical attributes. Introduced a fast summarization-base
d algorithm CACTUS for discovering such clusters in categorical data.
Evaluated algorithm against both synthetic and real datasets.
![Page 33: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/33.jpg)
Future Work Relax the cluster definition by allowing
sets of attribute values are “almost” strongly connected to each other.
Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm
Rank the clusters based on a measure of interestingness
![Page 34: CACTUS-Clustering Categorical Data Using Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062314/5681376f550346895d9f0930/html5/thumbnails/34.jpg)
Comments Pairwise cluster-projection is the NP-c
omplete problem A large number of candidate clusters i
s still a problem