An Improved K-medoid Clustering Algo

Post on 10-Apr-2016

230 views 3 download

description

The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm.

Transcript of An Improved K-medoid Clustering Algo

An Improved K-medoid Clustering Algorithm

By

Mohammad Imran Kabir:090911543

Birendra Singh Airy:100911335

Under the guidance of

Ms Aparna Nayak Assistant Professor Dept of ICT MIT, Manipal

CONTENTS

• Introduction• Literature Survey• Design• Implementation• Results• Conclusion• Reference

INTRODUCTION

What exactly is K-medoid Clustering?

• K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.

• A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.

• Actual objects are chosen to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar.

LITERATURE SURVEY

• Hybrid algorithm for Kmedoid clustering of large datasets published by Weiguo Sheng,Dept. of Inf. Syst. Comput, Brunel Univ, London, UK on June 2004 where the local search heuristic selects k-medoids from the data set and tries to effeciently minimize the total dissimilarity within each cluster.

• Parallelization of K-medoid clustering algorithm implemented by Aljoby, W. from Queen Arwa University in Yemen on March 2013 where the K-medoid algorithm will be divided into tasks, which will be mapped into multiprocessor system.

DESIGN

Class Diagram

Data Flow Diagram

Activity Diagram

IMPLEMENTATION

BASIC PAM

The most common realization of k-medoids clustering is the Partitioning Around Medoids (PAM) algorithmInput: k: the number of clusters; Dataset: a data setcontaining n objects.Output: A set of k clusters.1. arbitrarily choose k objects in Dataset as the initial

representative objects or seeds2. assign each remaining object to the cluster with

the nearest representative object;

3. Randomly select a non-representative object, Orandom

4. Compute the total cost, S, of swapping representative object, Oj, with Orandom

5. If S<0 then swap Oj with Orandom to form the new set

of k representative objects.7. Until no change

Distribution of Data

Cluster formation after initial medoid assumption

Cluster formation after swapping

IMPROVEMENT OF THE ALGORITHM

• To resolve the drawbacks of the traditional PAM algorithim by introducing a new improved K-medoid clustering algorithim based on CF-tree.

• This CF tree algorithm operates on the clustering features of the BIRCH(Balanced Iterative Reducing And Clustering using Hierarchies)algorithm.

Methodology

• We preserve all the training sample data in an CF-Tree(Clustering Feature Tree), then use k-medoids method to cluster the CF in leaf nodes of CF-Tree.

• Eventually, we can get k clusters from the root of the CF-Tree.

• Input: k: the number of clusters; Dataset: a data set containing n objects; B: maximum children for nonleaf nodes in CF-Tree set B=k; L: maximum entries for leaf nodes in CF-Tree; T: the threshold for the maximum diameter of an entry.

• Output: A set of k clusters.1. use data point in Dataset to create a CF-Tree, tree*;2. arbitrarily choose k leaf nodes in tree* as the initial

representative objects or seeds3. Repeat4. assign each remaining leaf node to the cluster,{OJ} (j =

1,2,....,k), with the nearest representative object based on formula (4)

ALGORITHM

5. assignment result in updated CF values which have to be propagated back to the root of the CF-tree

6. recompute the radius of all nodes based on formula (3), if the radius of any node is more than the threshold value T, one or several splits of the node can happen.

7. randomly select a non-representative object in leaf nodes ,Orandom, Orandom ≠ Oj

8. compute the total cost, S, of swapping representative object, Oj , with Orandom;

9. if S < 0 then swap Oj with Orandom to form the new set of k representative objects;

10.until no change; In this way we can get k clusters from the root of the tree*

because of B=k.

RESULTS

Operation On A Small Dataset• When we tested the basic PAM algorithm on a

small dataset of about 100 entries the total time to cluster the points came around 2 milliseconds and the after swap cost came around 585 which indicates the most optimum cost between the data points and the medoids.

• When we tested the improved Cf tree algorithm on a small dataset we got the following result

• Around the same time i.e, about 2 milliseconds.As you can see there's no difference in the total computation time of cluster formation even after using the enhanced cftree algorithm.

Operation On A Large Dataset

• As we increase the size of the dataset to about 3000 entries ,time taken for the basic PAM to work is about 49ms

• And time taken for the improved cftree algorithm to work is about 42ms

• which is significantly less than that of the basic PAM algorithm

• By plotting a graph of the total no of data points against the total time taken by the two algorithm we get the following curve

Conclusion

• The experimental results show that CFk-medoids algorithm reduces the usage of time and increases the result accuracy.

• CFk-medoids would greatly benefit from using a larger dataset.

Refrences• J. Han, M. Kamber. Data Mining: concepts and techniques,

Beijing: China Machine Press, 2006.• Rui Xu, D. Wunsch. “Survey of clustering algorithms”, IEEE

Transactions on Neural Networks, 2005, Vol. 16, No. 3, pp. 645-678.

• Ordonez C. Clustering binary data streams with K-means, //Proceedings of DMKD’03, June 2003, Vol. 13, pp. 12-19.

• A. K. Jain, M. N. Murty, P. J. Flynn. “Data clustering: a review”,ACM Computing Surveys, 1999, Vol. 31, No. 3, pp. 264-323.

• Stephen Johnson. “Hierarchical clustering schemes”, Psychometrika,1967, Vol. 32, No. 3, pp. 241-254.

• Barbara D. “Requirements for clustering data streams”, ACM SIGKDD Explorations Newsletter, 2003, Vol. 3, No. 2, pp. 23-27.

• Karypis G, Han E H, Kumar V. “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer,1999, Vol. 32, No. 8, pp. 68-75.

THANK YOU