Clustering

29
Presented By, Manasi C. Kadam Sharmishtha P. Alwekar Ganesh H. Satpute Deepak D. Ambegaonkar Rajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru

Transcript of Clustering

Page 1: Clustering

Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani

Under the guidanceProf. G. A. PatilMr. Varad Meru

Page 2: Clustering

Introduction

Clustering

K-means clustering algorithm

Canopy clustering algorithm

Complexity Evaluation

Conclusion

Future Enhancement

References

Agenda

Page 3: Clustering

Tedious task to maintain large Data

Types1. Structured

2. Unstructured

Introduction

Page 4: Clustering

Extracting information out of data

Two types

1. Exploratory or descriptive

2. Confirmative or inferential

Introduction to Data analysis

Page 5: Clustering

Goal is to discover the natural grouping(s) between

objects

Given n objects find K groups on measure of “similarity”

Organizing data into clusters such that there is

• high intra-cluster similarity

• low inter-cluster similarity

Ideal cluster - set of points that is compact and isolated

Ex. K-means algorithm, k-medoids etc.

Clustering(Aka Unsupervised Learning)

Page 6: Clustering
Page 7: Clustering

Cluster can differ in size, shape & density

Presence of noise

Cluster is a subjective entity

Automation

Problems in clustering

Page 8: Clustering

Types of Clustering Algorithm

1. Hierarchical

2. Partitional

Hierarchical – recursively finds nested clusters

Types

1. Agglomerative

2. Divisive

Partitional - finds all the clusters simultaneously

ex. K-means

Clustering Algorithm

Page 9: Clustering

K-means algorithm

Page 10: Clustering

Goal of K-means is to minimize the sum of the

squared error over all K clusters

K-means Algorithm(contd.)

Page 11: Clustering

Flowchart

Page 12: Clustering

Class Diagram of K-means

Page 13: Clustering

Most critical choice is K

Typically algorithm is run for various values of K and most appropriate output is selected

Different initialization can lead to different output

Parameter for K-means

Page 14: Clustering

Traditional clustering algorithm works well when

dataset has either property.

Large number of clusters

A high feature dimensionality

Large number of data points.

When dataset has all three property at once computation becomes expensive.

This necessitates need of new technique, thus canopy clustering

Canopy Clustering

Page 15: Clustering

Performs clustering in two stages

1. Rough and quick stage

2. Rigorous stage

Canopy Clustering(contd.)

Page 16: Clustering

Rough and quick stage

Uses extremely inexpensive method

divides the data into overlapping subsets called “canopies”

Rigorous stage

Uses rigorous and expensive metric

Clustering is applied only on canopy

Canopy Clustering(contd.)

Page 17: Clustering

Flowchart of Canopy Clustering

Page 18: Clustering

Source: Ref [2]

Page 19: Clustering

Output of K-means on Mathematica on Same Dataset

Page 20: Clustering

Output of K-means on R on Same Dataset

Page 21: Clustering
Page 22: Clustering

Output of K-means on Microsoft Excel on Same

Dataset

Page 23: Clustering
Page 24: Clustering

Output of canopy on Excel on Same Dataset

Page 25: Clustering

Complexity of K-means is O(nk), where n is number

of objects and k is number of centroids

Canopy based K-means changes to O(nkf2/c)

c is no of canopies

f is average no of canopies that each data point falls into

As f is very small number and c is comparatively big, the complexity is reduced

Complexity

Page 26: Clustering

Implemented K-means Algorithm

Verified Result on Mathematica, R

Implemented Canopy Clustering

Verified Result on Excel

Conclusion

Page 27: Clustering

Learning Hadoop and MapReduce

Parallelizing K-Means based on MapReduce and comparing the implementation

Running All the of K-means on standard dataset

Future Enhancement

Page 28: Clustering

Anil K. Jain, “Data Clustering: 50 Years Beyond K-

Means”

Andrew McCallum et al., “Efficient Clustering of

High Dimensional Data Sets with Application to Reference Matching”

References

Page 29: Clustering

Thank You