An Improved K-medoid Clustering Algo
-
Upload
rahul-negi -
Category
Documents
-
view
230 -
download
3
description
Transcript of An Improved K-medoid Clustering Algo
An Improved K-medoid Clustering Algorithm
By
Mohammad Imran Kabir:090911543
Birendra Singh Airy:100911335
Under the guidance of
Ms Aparna Nayak Assistant Professor Dept of ICT MIT, Manipal
CONTENTS
• Introduction• Literature Survey• Design• Implementation• Results• Conclusion• Reference
INTRODUCTION
What exactly is K-medoid Clustering?
• K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.
• A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
• Actual objects are chosen to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar.
LITERATURE SURVEY
• Hybrid algorithm for Kmedoid clustering of large datasets published by Weiguo Sheng,Dept. of Inf. Syst. Comput, Brunel Univ, London, UK on June 2004 where the local search heuristic selects k-medoids from the data set and tries to effeciently minimize the total dissimilarity within each cluster.
• Parallelization of K-medoid clustering algorithm implemented by Aljoby, W. from Queen Arwa University in Yemen on March 2013 where the K-medoid algorithm will be divided into tasks, which will be mapped into multiprocessor system.
DESIGN
Class Diagram
Data Flow Diagram
Activity Diagram
IMPLEMENTATION
BASIC PAM
The most common realization of k-medoids clustering is the Partitioning Around Medoids (PAM) algorithmInput: k: the number of clusters; Dataset: a data setcontaining n objects.Output: A set of k clusters.1. arbitrarily choose k objects in Dataset as the initial
representative objects or seeds2. assign each remaining object to the cluster with
the nearest representative object;
3. Randomly select a non-representative object, Orandom
4. Compute the total cost, S, of swapping representative object, Oj, with Orandom
5. If S<0 then swap Oj with Orandom to form the new set
of k representative objects.7. Until no change
Distribution of Data
Cluster formation after initial medoid assumption
Cluster formation after swapping
IMPROVEMENT OF THE ALGORITHM
• To resolve the drawbacks of the traditional PAM algorithim by introducing a new improved K-medoid clustering algorithim based on CF-tree.
• This CF tree algorithm operates on the clustering features of the BIRCH(Balanced Iterative Reducing And Clustering using Hierarchies)algorithm.
Methodology
• We preserve all the training sample data in an CF-Tree(Clustering Feature Tree), then use k-medoids method to cluster the CF in leaf nodes of CF-Tree.
• Eventually, we can get k clusters from the root of the CF-Tree.
• Input: k: the number of clusters; Dataset: a data set containing n objects; B: maximum children for nonleaf nodes in CF-Tree set B=k; L: maximum entries for leaf nodes in CF-Tree; T: the threshold for the maximum diameter of an entry.
• Output: A set of k clusters.1. use data point in Dataset to create a CF-Tree, tree*;2. arbitrarily choose k leaf nodes in tree* as the initial
representative objects or seeds3. Repeat4. assign each remaining leaf node to the cluster,{OJ} (j =
1,2,....,k), with the nearest representative object based on formula (4)
ALGORITHM
5. assignment result in updated CF values which have to be propagated back to the root of the CF-tree
6. recompute the radius of all nodes based on formula (3), if the radius of any node is more than the threshold value T, one or several splits of the node can happen.
7. randomly select a non-representative object in leaf nodes ,Orandom, Orandom ≠ Oj
8. compute the total cost, S, of swapping representative object, Oj , with Orandom;
9. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;
10.until no change; In this way we can get k clusters from the root of the tree*
because of B=k.
RESULTS
Operation On A Small Dataset• When we tested the basic PAM algorithm on a
small dataset of about 100 entries the total time to cluster the points came around 2 milliseconds and the after swap cost came around 585 which indicates the most optimum cost between the data points and the medoids.
• When we tested the improved Cf tree algorithm on a small dataset we got the following result
• Around the same time i.e, about 2 milliseconds.As you can see there's no difference in the total computation time of cluster formation even after using the enhanced cftree algorithm.
Operation On A Large Dataset
• As we increase the size of the dataset to about 3000 entries ,time taken for the basic PAM to work is about 49ms
• And time taken for the improved cftree algorithm to work is about 42ms
• which is significantly less than that of the basic PAM algorithm
• By plotting a graph of the total no of data points against the total time taken by the two algorithm we get the following curve
Conclusion
• The experimental results show that CFk-medoids algorithm reduces the usage of time and increases the result accuracy.
• CFk-medoids would greatly benefit from using a larger dataset.
Refrences• J. Han, M. Kamber. Data Mining: concepts and techniques,
Beijing: China Machine Press, 2006.• Rui Xu, D. Wunsch. “Survey of clustering algorithms”, IEEE
Transactions on Neural Networks, 2005, Vol. 16, No. 3, pp. 645-678.
• Ordonez C. Clustering binary data streams with K-means, //Proceedings of DMKD’03, June 2003, Vol. 13, pp. 12-19.
• A. K. Jain, M. N. Murty, P. J. Flynn. “Data clustering: a review”,ACM Computing Surveys, 1999, Vol. 31, No. 3, pp. 264-323.
• Stephen Johnson. “Hierarchical clustering schemes”, Psychometrika,1967, Vol. 32, No. 3, pp. 241-254.
• Barbara D. “Requirements for clustering data streams”, ACM SIGKDD Explorations Newsletter, 2003, Vol. 3, No. 2, pp. 23-27.
• Karypis G, Han E H, Kumar V. “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer,1999, Vol. 32, No. 8, pp. 68-75.
THANK YOU