Vaibhav Selot (04305813) Md Shabeeruddin...

Clustering Algorithms

Vaibhav Selot (04305813)Md Shabeeruddin (04305901)

Roadmap

» What is Clustering?

» Motivation for Clustering Algorithms.

» Components of Clustering task.

» Issues in Clustering Algorithms.

» Clustering Algorithms.

» Conclusion.

What is Clustering?

Clustering is Unsupervised classification of patterns

based on similarity. A cluster is therefore a collection

of objects which are similar

between them and are dissimilar

to the objects belonging to

other clusters.

Motivation

• Supervised classification vs Unsupervised classification

1. Provided with preclassified 1. No such supervision in

(labeled) patterns. this Classification.

2. Problem is to correctly classify 2. Problem is to group a given

a new pattern. collection of unlabeled patterns into meaningful clusters.

• Exploratory Data Analysis applications like Data Mining , Face Recognition,Signature Matching, FingerPrint ,Image segmentation, Knowledge Acquisition have little prior knowledge of information about the data.

• Clustering can be used for exploring interrelationships among the data points in these applications.

Notations

Pattern X = (x 1 , . . . .x d)

where xi = feature or attribute

d = dimensionality of pattern space

Pattern set p = {X1 , . . . ,Xn}

p is n x d pattern matrix.

Components of Clustering Tasks

● Pattern representation and

Feature selection / extraction.● Similarity Measure.● Clustering algorithm.● Data abstraction. ● Assessment of output .

Pattern Representation and Feature Selection/Extraction

● Pa tte rns a re rep resented by n-dim ensiona l fea tu re v ec tors .

● T he m ost disc rim ina ting fea tu re s a re se lec ted.

● N ew fea tu res a re com p u ted u sing fea tu re ex trac tion

techniq u es.

● T o redu ce the dim ensiona lity of p roblem sp ace , only su bse t of

fea tu res a re se lec ted for c lu ste ring a lgorithm .

● Fea tu re se lec tion/ex trac tion needs good dea l of dom ain

know ledge .

This is a function which takes two sets of data items as input, and

returns as output a similarity measure between them.

Conventional Similarity measure is distance.

1)Minkowski distance

2) Manhattan distance

3)Euclidean distance

Similarity Measures

4) Mutual Neighbor Distance (MND)

s(Xi,Xj)=f(Xi,Xj,ξ)

where ξ is the Context

MND(Xi,Xj)=NN(Xi,Xj)+NN(Xj,Xi)

NN(Xi,Xj) is neighbor number of Xj w.r.t Xi

In fig 1. NN(A,B)=NN(B,A)=1

NN(B,C)=1 NN(C,B)=2

MND(A,B)=2 MND(B,C)=3

In fig 2.

MND(A,B)=5 MND(B,C)=3

Similarity Measures

5) Conceptual Measure

s(X i,X j)= f(X i,X j,ξ,ζ)

where

ξ is the C ontex t

ζ is a set of Predefined concepts

Similarity Measures

• Clustering algorithm uses a similarity measure.

• Choice of Clustering algorithm depends on the desired properties of the final clustering result.

• Time and space complexity also affect the above choice.

Clustering algorithm

• Each cluster resulted due to clustering is compactly described in terms of representative patterns such as centroids.

• Assessment of Clustering output is based on specific criterion of optimality. This criterion selected depends on the Domain.

Data Abstraction & Assessment of output

1.Agglomerative vs. divisive

2.Monothetic vs. polythetic

3.Hard vs. fuzzy

4.Deterministic vs. stochastic

5.Incremental vs. non-incremental

Issues in Clustering Techniques

Taxonomy of Clustering Techniques

Produces a nested series of partitions based on criterion for merging/splitting

based on similarity.Algorithm

1. Start with each data item in a distinct cluster.

2. Merge the two clusters with min distance.

3. Continue step 2 , until clustering criterion is satisfied.

Clustering Criterion

# of desired Clusters.

Threshold on Distance.

Hierarchical Clustering

Distance MeasurementSinglelink Clustering

Min of each pairwise distances of the data items of two clusters.

Completelink Clustering

Max of each pairwise distances of the data items of two clusters.


Dendrogram


Disadvantage of Singlelink Chaining Effect

Singlelink Completelink


Disadvantage in Hierarchical Clustering

For large data sets, constructing Dendrogram is computationly

expensive.

Partitional Clustering

Identifies the partition that optimizes a Criterion function.

Criterion function

where X(j)i is the i th pattern belonging to the j th cluster

Cj is the centroid of the j th cluster.

nj is # of patterns in j th cluster. K is # of clusters.

P is Pattern set. L is set of clusters.


kmeans Clustering Algorithm

1.Start with k cluster and initialize each with a randomlychosen patterns .

2.Assign each pattern to the closest cluster center.

3.Recompute the cluster centers using the current cluster memberships. If a convergence criterion is not met,

go to step 2.

Typical convergence criteria are:

no (or minimal) reassignment of patterns to new cluster, or minimal decrease in squared error.


Example

Disadvantage

1. # of clusters should be

known in advance.

2. Algorithm is sensitive

to the selection of initial

partition.

3. Not suitable to discover

clusters with nonconvex shapes.


1. Will the algorithm terminate?

In each step , we shift data item X i from cluster Cm to Cn only when

| X i Cn | < | X i – Cm |

2. Will it find an optimal clustering?


3. Why to choose Centroid in calculating Sum squared error ?

The Partial derivative of error w.r.t center location must be zero.


SingleLink Algorithm O(n2)

CompleteLink Algorithm O(n3)

n mearging steps, each requies O(n2) comparisons

K means algorithm O(n)

Each iteration requires O(n)

Constant number of iterations

Time Complexities

K means algorithm is very efficient in terms of computational time,

but it is very sensitive to initial choice and it is less versatile.

On the other hand , hierarchical algorithms are more versatile ,but they are

computationally expensive.

In Practise, Kmeans and its various forms are used for large data sets ahead

of Hierarchical algorithms.

Conclusions

1. A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a

review.ACM Computing Surveys, 1999.

2. Andrew W. Moore. Kmeans and Hierarchical Clustering.

www.cs.cmu.edu/~awm/tutorials.

3. Hassan. Clustering Algorithms.

http://www.carleton.ca/~hmasum/clustering.html

References

Observing Performance of various Clustering algorithms

on Sample data sets

Proposal for Project

Vaibhav Selot (04305813) Md Shabeeruddin...

Documents

Transcript of Vaibhav Selot (04305813) Md Shabeeruddin...