Clustering

Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani

Under the guidanceProf. G. A. PatilMr. Varad Meru

Introduction

Clustering

K-means clustering algorithm

Canopy clustering algorithm

Complexity Evaluation

Conclusion

Future Enhancement

References

Agenda

Tedious task to maintain large Data

Types1. Structured

2. Unstructured

Introduction

Extracting information out of data

Two types

1. Exploratory or descriptive

2. Confirmative or inferential

Introduction to Data analysis

Goal is to discover the natural grouping(s) between

objects

Given n objects find K groups on measure of “similarity”

Organizing data into clusters such that there is

• high intra-cluster similarity

• low inter-cluster similarity

Ideal cluster - set of points that is compact and isolated

Ex. K-means algorithm, k-medoids etc.

Clustering(Aka Unsupervised Learning)

Cluster can differ in size, shape & density

Presence of noise

Cluster is a subjective entity

Automation

Problems in clustering

Types of Clustering Algorithm

1. Hierarchical

2. Partitional

Hierarchical – recursively finds nested clusters

Types

1. Agglomerative

2. Divisive

Partitional - finds all the clusters simultaneously

ex. K-means

Clustering Algorithm

K-means algorithm

Goal of K-means is to minimize the sum of the

squared error over all K clusters

K-means Algorithm(contd.)

Flowchart

Class Diagram of K-means

Most critical choice is K

Typically algorithm is run for various values of K and most appropriate output is selected

Different initialization can lead to different output

Parameter for K-means

Traditional clustering algorithm works well when

dataset has either property.

Large number of clusters

A high feature dimensionality

Large number of data points.

When dataset has all three property at once computation becomes expensive.

This necessitates need of new technique, thus canopy clustering

Canopy Clustering

Performs clustering in two stages

1. Rough and quick stage

2. Rigorous stage

Canopy Clustering(contd.)

Rough and quick stage

Uses extremely inexpensive method

divides the data into overlapping subsets called “canopies”

Rigorous stage

Uses rigorous and expensive metric

Clustering is applied only on canopy

Canopy Clustering(contd.)

Flowchart of Canopy Clustering

Source: Ref [2]

Output of K-means on Mathematica on Same Dataset

Output of K-means on R on Same Dataset

Output of K-means on Microsoft Excel on Same

Dataset

Output of canopy on Excel on Same Dataset

Complexity of K-means is O(nk), where n is number

of objects and k is number of centroids

Canopy based K-means changes to O(nkf2/c)

c is no of canopies

f is average no of canopies that each data point falls into

As f is very small number and c is comparatively big, the complexity is reduced

Complexity

Implemented K-means Algorithm

Verified Result on Mathematica, R

Implemented Canopy Clustering

Verified Result on Excel

Conclusion

Learning Hadoop and MapReduce

Parallelizing K-Means based on MapReduce and comparing the implementation

Running All the of K-means on standard dataset

Future Enhancement

Anil K. Jain, “Data Clustering: 50 Years Beyond K-

Means”

Andrew McCallum et al., “Efficient Clustering of

High Dimensional Data Sets with Application to Reference Matching”

References

Thank You

Clustering

Documents

Transcript of Clustering