An Improved K-medoid Clustering Algo

30
An Improved K-medoid Clustering Algorithm By Mohammad Imran Kabir:090911543 Birendra Singh Airy:100911335 Under the guidance of Ms Aparna Nayak Assistant Professor Dept of ICT MIT, Manipal

description

The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm.

Transcript of An Improved K-medoid Clustering Algo

Page 1: An Improved K-medoid Clustering Algo

An Improved K-medoid Clustering Algorithm

By

Mohammad Imran Kabir:090911543

Birendra Singh Airy:100911335

Under the guidance of

Ms Aparna Nayak Assistant Professor Dept of ICT MIT, Manipal

Page 2: An Improved K-medoid Clustering Algo

CONTENTS

• Introduction• Literature Survey• Design• Implementation• Results• Conclusion• Reference

Page 3: An Improved K-medoid Clustering Algo

INTRODUCTION

Page 4: An Improved K-medoid Clustering Algo

What exactly is K-medoid Clustering?

• K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.

• A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.

• Actual objects are chosen to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar.

Page 5: An Improved K-medoid Clustering Algo

LITERATURE SURVEY

Page 6: An Improved K-medoid Clustering Algo

• Hybrid algorithm for Kmedoid clustering of large datasets published by Weiguo Sheng,Dept. of Inf. Syst. Comput, Brunel Univ, London, UK on June 2004 where the local search heuristic selects k-medoids from the data set and tries to effeciently minimize the total dissimilarity within each cluster.

• Parallelization of K-medoid clustering algorithm implemented by Aljoby, W. from Queen Arwa University in Yemen on March 2013 where the K-medoid algorithm will be divided into tasks, which will be mapped into multiprocessor system.

Page 7: An Improved K-medoid Clustering Algo

DESIGN

Page 8: An Improved K-medoid Clustering Algo

Class Diagram

Page 9: An Improved K-medoid Clustering Algo

Data Flow Diagram

Page 10: An Improved K-medoid Clustering Algo

Activity Diagram

Page 11: An Improved K-medoid Clustering Algo

IMPLEMENTATION

Page 12: An Improved K-medoid Clustering Algo

BASIC PAM

The most common realization of k-medoids clustering is the Partitioning Around Medoids (PAM) algorithmInput: k: the number of clusters; Dataset: a data setcontaining n objects.Output: A set of k clusters.1. arbitrarily choose k objects in Dataset as the initial

representative objects or seeds2. assign each remaining object to the cluster with

the nearest representative object;

Page 13: An Improved K-medoid Clustering Algo

3. Randomly select a non-representative object, Orandom

4. Compute the total cost, S, of swapping representative object, Oj, with Orandom

5. If S<0 then swap Oj with Orandom to form the new set

of k representative objects.7. Until no change

Page 14: An Improved K-medoid Clustering Algo

Distribution of Data

Page 15: An Improved K-medoid Clustering Algo

Cluster formation after initial medoid assumption

Page 16: An Improved K-medoid Clustering Algo

Cluster formation after swapping

Page 17: An Improved K-medoid Clustering Algo

IMPROVEMENT OF THE ALGORITHM

• To resolve the drawbacks of the traditional PAM algorithim by introducing a new improved K-medoid clustering algorithim based on CF-tree.

• This CF tree algorithm operates on the clustering features of the BIRCH(Balanced Iterative Reducing And Clustering using Hierarchies)algorithm.

Page 18: An Improved K-medoid Clustering Algo

Methodology

• We preserve all the training sample data in an CF-Tree(Clustering Feature Tree), then use k-medoids method to cluster the CF in leaf nodes of CF-Tree.

• Eventually, we can get k clusters from the root of the CF-Tree.

Page 19: An Improved K-medoid Clustering Algo

• Input: k: the number of clusters; Dataset: a data set containing n objects; B: maximum children for nonleaf nodes in CF-Tree set B=k; L: maximum entries for leaf nodes in CF-Tree; T: the threshold for the maximum diameter of an entry.

• Output: A set of k clusters.1. use data point in Dataset to create a CF-Tree, tree*;2. arbitrarily choose k leaf nodes in tree* as the initial

representative objects or seeds3. Repeat4. assign each remaining leaf node to the cluster,{OJ} (j =

1,2,....,k), with the nearest representative object based on formula (4)

ALGORITHM

Page 20: An Improved K-medoid Clustering Algo

5. assignment result in updated CF values which have to be propagated back to the root of the CF-tree

6. recompute the radius of all nodes based on formula (3), if the radius of any node is more than the threshold value T, one or several splits of the node can happen.

7. randomly select a non-representative object in leaf nodes ,Orandom, Orandom ≠ Oj

8. compute the total cost, S, of swapping representative object, Oj , with Orandom;

9. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;

10.until no change; In this way we can get k clusters from the root of the tree*

because of B=k.

Page 21: An Improved K-medoid Clustering Algo

RESULTS

Page 22: An Improved K-medoid Clustering Algo

Operation On A Small Dataset• When we tested the basic PAM algorithm on a

small dataset of about 100 entries the total time to cluster the points came around 2 milliseconds and the after swap cost came around 585 which indicates the most optimum cost between the data points and the medoids.

Page 23: An Improved K-medoid Clustering Algo

• When we tested the improved Cf tree algorithm on a small dataset we got the following result

• Around the same time i.e, about 2 milliseconds.As you can see there's no difference in the total computation time of cluster formation even after using the enhanced cftree algorithm.

Page 24: An Improved K-medoid Clustering Algo

Operation On A Large Dataset

• As we increase the size of the dataset to about 3000 entries ,time taken for the basic PAM to work is about 49ms

Page 25: An Improved K-medoid Clustering Algo

• And time taken for the improved cftree algorithm to work is about 42ms

• which is significantly less than that of the basic PAM algorithm

Page 26: An Improved K-medoid Clustering Algo

• By plotting a graph of the total no of data points against the total time taken by the two algorithm we get the following curve

Page 27: An Improved K-medoid Clustering Algo

Conclusion

• The experimental results show that CFk-medoids algorithm reduces the usage of time and increases the result accuracy.

• CFk-medoids would greatly benefit from using a larger dataset.

Page 28: An Improved K-medoid Clustering Algo

Refrences• J. Han, M. Kamber. Data Mining: concepts and techniques,

Beijing: China Machine Press, 2006.• Rui Xu, D. Wunsch. “Survey of clustering algorithms”, IEEE

Transactions on Neural Networks, 2005, Vol. 16, No. 3, pp. 645-678.

• Ordonez C. Clustering binary data streams with K-means, //Proceedings of DMKD’03, June 2003, Vol. 13, pp. 12-19.

• A. K. Jain, M. N. Murty, P. J. Flynn. “Data clustering: a review”,ACM Computing Surveys, 1999, Vol. 31, No. 3, pp. 264-323.

Page 29: An Improved K-medoid Clustering Algo

• Stephen Johnson. “Hierarchical clustering schemes”, Psychometrika,1967, Vol. 32, No. 3, pp. 241-254.

• Barbara D. “Requirements for clustering data streams”, ACM SIGKDD Explorations Newsletter, 2003, Vol. 3, No. 2, pp. 23-27.

• Karypis G, Han E H, Kumar V. “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer,1999, Vol. 32, No. 8, pp. 68-75.

Page 30: An Improved K-medoid Clustering Algo

THANK YOU