A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional...
-
Upload
william-luter -
Category
Documents
-
view
222 -
download
0
Transcript of A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional...
A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets
1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2
1Department of Computer Engineering, Maltepe University
[email protected] of Computer and Control Education,Marmara University
Outline Introduction Relationship based clustering approach / framework Visualization using CLUSION (CLUSter visualizatION) Problems of the Framework Graclus partitioning system Our Proposed Framework
Using Graclus: to create Micro-partition Space Outlier filtering on micro-partition space Using Graclus: to cluster ΔP Space Visualization of the results using CLUSION graphs
Experiments Results
Introduction
Mining high dimensional datasets are an important problem of Data Mining community
Well-known problem: curse of dimensionality Graph based methods such as METIS and
CHACO perform best on high dimensional space However, these methods have 2 major
problems: can not perform outlier filtering Force clusters to be balanced
Relationship based Clustering Approach Strehl A. and Ghosh J. proposed a better approach for
mining high dimensional datasets [1]. They focus on similarity space rather than Feature
space. A graph partitioning tool METIS is used to perform
balanced clustering (OPOSSUM) They also provide a customized matrix visualization tool
called CLUSION. CLUSION is fast,simple and it can operate on very high
dimensional datasets.
Relationship based Clustering Framework
Data Sources Feature Space Similarity SpaceCluster Labels
Feature Extraction
Similarity
computation
OPOSSUM(Optimal partitioning of Similarity space using Metis)
Visualization using CLUSION
Clusters appear as symmetrical dark squares across the main diagonal
Similarity Matrix
λ index
CLUSION
S is permuted with a nxn permutation matrix P
Cluster Visualization
Problems of the Framework
Produces balanced clusters only:
It forces clusters to be of equal size. In some datasests this could be important, because it avoids trivial clusterings. But in most cases, can cause undesired results.
No outlier filtering :
Outliers can reduce the quality and the validity of the clusters depending on the resolution and distribution of the dataset.
Graclus* partitioning system
Graclus* is a fast kernel based multilevel algorithm which involves coarsening, initial partitioning and refinement phases.
Unlike METIS, it does not force clusters to be nearly,equal size. Uses weighted form of kernel based k-means
approach kernel k-means approach is extremely fast and gives
high-quality partitions (*)
* Dhillon, I., Guan, Y., Kulis,B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering, Proceedings of The 11th ACM SIGKDD, Chicago, IL, August 21 - 24, (2005).
Our Proposed Framework Three major improvements:
An intermediate space (P):We call it “micro-partition space”. Graclus is used for creating
unbalanced micro-partitions. Outlier filtering on the P space (results ΔP) :
Graclus creates micro-partitions of different sizes. The singletons on the P space means the points that have not enough neighbors can be filtered or marked as outliers.
Using Graclus for clustering ΔP space:Graclus has two important roles on our framework. The first role is creating the micro-partition space .The second role is unbalanced clustering of the filtered space ΔP which is denoted by Φ.
Our Proposed Framework
creating micro-partitions
(using Graclus)
Micro-partition space (P) Contains unbalanced tiny partitions
outlier filtering and (re)clustering (using
Graclus) results ΔP Space
ΔP
Use Graclus in Similarity Space to create tiny partitions (micro-partitions)
Notation: n = number of samples, k = number of micro-partitions on P space
relation between k and p should be: [1] Micro-partitions can contain up to 4 objects,
therefore: [2]
Using Graclus: to create Micro-partition Space
Outlier filtering on micro-partition space illustration
Outlier filtering on micro-partition space
Outliers in P space (Po) is:
where To is Outlier threshold value Then, ΔP space is:
Graclus needs the number of partitions k. In formula [1], k refers to the number of micro
partitions. Here k refers to the number of clusters we desire.
we denote the former one by k1 and the latter one by k2 .
Graclus performs clustering on the ΔP space and produces λ index which is defined as:
Using Graclus: to cluster ΔP Space
Visualization of the results using CLUSION graphs CLUSION looks at the λ,
reorders the ΔP space so that points with same cluster label are contiguous
then visualize the resulting permuted ΔP′
there are two λ indices produced during clustering process. λ1 is created while forming micro-partitions
λ2 is created while clustering ΔP space
We use λ2 for CLUSION, the first one is only used for forming micro-partitions
Experiments: Datasets
We evaluated our proposed framework on two different real world datasets.
1. 9636 terms from 2225 complete news articles from the BBC News web site. (2225 dimensional dataset, 5 natural clusters)
2. Collection of news articles from Turkish newspaper “Milliyet”. Contains 6223 terms in Turkish from 1455 news articles. (1455 dimensional dataset, 3 natural clusters)
Experiments:Evaluated Frameworks
OPOSSUM: Strehl & Ghosh’s METIS based original framework
S&G(Graclus):We replaced METIS by Graclus on Strehl & Ghosh’s framework for testing the quality of the clusters produced by Graclus algorithm.
P space+Graclus: Our proposed framework.
Experiments: Comparison Criteria
Purity Entropy Mutual Information CLUSION graphics
(visually identification, visual data mining)
Results: BBC Dataset
Results: BBC Dataset OPOSSUM
Results: BBC Dataset S&G(Graclus):
Results:BBC Dataset P space+Graclus
Results: Milliyet Dataset
Results: Milliyet Dataset OPOSSUM
Results: Milliyet Dataset S&G(Graclus):
Results:Milliyet Dataset P space+Graclus
Thank You!
Presenter : T.Tugay BiLGiN