Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering
-
Upload
ganesh-satpute -
Category
Documents
-
view
345 -
download
4
Transcript of Clustering
Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani
Under the guidanceProf. G. A. PatilMr. Varad Meru
Introduction
Clustering
K-means clustering algorithm
Canopy clustering algorithm
Complexity Evaluation
Conclusion
Future Enhancement
References
Agenda
Tedious task to maintain large Data
Types1. Structured
2. Unstructured
Introduction
Extracting information out of data
Two types
1. Exploratory or descriptive
2. Confirmative or inferential
Introduction to Data analysis
Goal is to discover the natural grouping(s) between
objects
Given n objects find K groups on measure of “similarity”
Organizing data into clusters such that there is
• high intra-cluster similarity
• low inter-cluster similarity
Ideal cluster - set of points that is compact and isolated
Ex. K-means algorithm, k-medoids etc.
Clustering(Aka Unsupervised Learning)
Cluster can differ in size, shape & density
Presence of noise
Cluster is a subjective entity
Automation
Problems in clustering
Types of Clustering Algorithm
1. Hierarchical
2. Partitional
Hierarchical – recursively finds nested clusters
Types
1. Agglomerative
2. Divisive
Partitional - finds all the clusters simultaneously
ex. K-means
Clustering Algorithm
K-means algorithm
Goal of K-means is to minimize the sum of the
squared error over all K clusters
K-means Algorithm(contd.)
Flowchart
Class Diagram of K-means
Most critical choice is K
Typically algorithm is run for various values of K and most appropriate output is selected
Different initialization can lead to different output
Parameter for K-means
Traditional clustering algorithm works well when
dataset has either property.
Large number of clusters
A high feature dimensionality
Large number of data points.
When dataset has all three property at once computation becomes expensive.
This necessitates need of new technique, thus canopy clustering
Canopy Clustering
Performs clustering in two stages
1. Rough and quick stage
2. Rigorous stage
Canopy Clustering(contd.)
Rough and quick stage
Uses extremely inexpensive method
divides the data into overlapping subsets called “canopies”
Rigorous stage
Uses rigorous and expensive metric
Clustering is applied only on canopy
Canopy Clustering(contd.)
Flowchart of Canopy Clustering
Source: Ref [2]
Output of K-means on Mathematica on Same Dataset
Output of K-means on R on Same Dataset
Output of K-means on Microsoft Excel on Same
Dataset
Output of canopy on Excel on Same Dataset
Complexity of K-means is O(nk), where n is number
of objects and k is number of centroids
Canopy based K-means changes to O(nkf2/c)
c is no of canopies
f is average no of canopies that each data point falls into
As f is very small number and c is comparatively big, the complexity is reduced
Complexity
Implemented K-means Algorithm
Verified Result on Mathematica, R
Implemented Canopy Clustering
Verified Result on Excel
Conclusion
Learning Hadoop and MapReduce
Parallelizing K-Means based on MapReduce and comparing the implementation
Running All the of K-means on standard dataset
Future Enhancement
Anil K. Jain, “Data Clustering: 50 Years Beyond K-
Means”
Andrew McCallum et al., “Efficient Clustering of
High Dimensional Data Sets with Application to Reference Matching”
References
Thank You