Vaibhav Selot (04305813) Md Shabeeruddin...

27
Clustering  Algorithms Vaibhav Selot  (04305813) Md  Shabeeruddin (04305901)

Transcript of Vaibhav Selot (04305813) Md Shabeeruddin...

Page 1: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Clustering  Algorithms

Vaibhav Selot  (04305813)Md  Shabeeruddin (04305901)

Page 2: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Roadmap

» What is Clustering?

» Motivation for Clustering Algorithms.

» Components of Clustering task.

» Issues in Clustering Algorithms.

» Clustering Algorithms.

» Conclusion.

Page 3: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

What is Clustering?  

 Clustering  is Unsupervised   classification  of  patterns         

  based on similarity. A cluster is therefore a collection 

 of objects which are similar 

 between them and are dissimilar

 to the objects belonging to 

 other clusters.

Page 4: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Motivation

•  Supervised classification            vs        Unsupervised classification

   1. Provided with pre­classified           1. No such supervision in

       (labeled)  patterns.                              this  Classification.

   2. Problem is to correctly classify       2. Problem is to group a given 

       a new pattern.  collection  of unlabeled patterns into meaningful clusters.

•   Exploratory Data Analysis applications like  Data Mining , Face            Recognition,Signature Matching, Finger­Print ,Image segmentation,                     Knowledge Acquisition  have little prior knowledge of information about            the data.

•    Clustering can be used for exploring inter­relationships among the  data           points in these applications.

Page 5: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Notations

Pattern X = (x 1 , . . . .x d)

where xi  =  feature or attribute

  d   =  dimensionality of pattern  space

Pattern  set   p  =  {X1 , . . . ,Xn}

p  is  n x d pattern matrix.

Page 6: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Components of Clustering Tasks

● Pattern  representation  and 

     Feature selection / extraction.● Similarity Measure.● Clustering algorithm.● Data abstraction. ●  Assessment  of  output .

Page 7: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Pattern  Representation and Feature Selection/Extraction

● Pa tte rns a re rep resented by n-dim ensiona l fea tu re v ec tors .

● T he m ost disc rim ina ting fea tu re s a re se lec ted.

● N ew fea tu res a re com p u ted u sing fea tu re ex trac tion

techniq u es.

● T o redu ce the dim ensiona lity of p roblem sp ace , only su bse t of

fea tu res a re se lec ted for c lu ste ring a lgorithm .

● Fea tu re se lec tion/ex trac tion needs good dea l of dom ain

know ledge .

Page 8: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

This is a function which takes two sets of data items as input, and

returns as output a similarity measure between them.

Conventional Similarity measure is distance.

1)Minkowski distance

2) Manhattan distance

3)Euclidean distance

Similarity Measures

Page 9: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

4) Mutual Neighbor Distance (MND)

 s(Xi,Xj)=f(Xi,Xj,ξ)                                                 

     where ξ is the Context

     MND(Xi,Xj)=NN(Xi,Xj)+NN(Xj,Xi)

     NN(Xi,Xj)  is neighbor number of Xj w.r.t Xi

     In fig 1.  NN(A,B)=NN(B,A)=1

      NN(B,C)=1       NN(C,B)=2

                   MND(A,B)=2    MND(B,C)=3               

     In fig 2.  

                   MND(A,B)=5    MND(B,C)=3

Similarity Measures

Page 10: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

5) Conceptual Measure

s(X i,X j)= f(X i,X j,ξ,ζ)

where 

ξ is the C ontex t

ζ is a set of Predefined concepts

Similarity Measures

Page 11: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

•  Clustering algorithm uses a similarity measure.

•  Choice of Clustering algorithm depends on the desired                            properties of  the final clustering result.

•  Time and space complexity also affect the above choice.

Clustering algorithm

Page 12: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

•  Each cluster resulted due to clustering is  compactly          described  in terms of  representative patterns such as centroids.

•  Assessment of Clustering output is based on        specific criterion of optimality. This criterion       selected depends on the Domain.

Data Abstraction & Assessment of output

Page 13: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

1.Agglomerative vs. divisive

2.Monothetic vs. polythetic

3.Hard vs. fuzzy

4.Deterministic vs. stochastic

5.Incremental vs. non-incremental

Issues in Clustering Techniques

Page 14: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Taxonomy of Clustering Techniques

Page 15: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Produces a nested series of partitions based on criterion for merging/splitting 

based on similarity.Algorithm 

1. Start with each data item in a distinct cluster.

2. Merge the two clusters with min distance.

3. Continue step 2 , until clustering criterion is satisfied.

Clustering Criterion

# of desired Clusters.

Threshold on Distance.

Hierarchical Clustering

Page 16: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Distance MeasurementSingle­link Clustering 

Min of each pair­wise distances of the data items of two clusters.

Complete­link Clustering

Max of each pair­wise distances of the data items of two clusters.

Hierarchical Clustering

Page 17: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Dendrogram

Hierarchical Clustering

Page 18: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Disadvantage of Single­link ­­ Chaining Effect

Single­link   Complete­link

Hierarchical Clustering

Page 19: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Disadvantage in Hierarchical Clustering

For large data sets, constructing Dendrogram is computationly 

expensive.

Partitional Clustering

Identifies  the partition that optimizes a Criterion function.

Criterion function

where  X(j)i     is the i th pattern belonging to the j th cluster

Cj  is the centroid of the j th cluster.

nj is # of patterns in j th cluster.     K is # of clusters.    

P  is Pattern set.                   L is set of clusters.

Partitional Clustering

Page 20: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

    k­means  Clustering Algorithm

1.Start with k cluster and initialize each with a randomly­chosen patterns .

2.Assign each pattern to the closest cluster center.

3.Recompute the cluster centers using the current cluster memberships.    If a convergence criterion is not met,

            go to step 2. 

Typical convergence criteria are: 

no (or minimal) reassignment of patterns to new cluster, or minimal decrease in squared error.

Partitional Clustering

Page 21: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Example

Disadvantage

1. # of  clusters should be

       known in advance.

2. Algorithm is sensitive 

    to the selection of initial

          partition.

3. Not suitable to discover 

    clusters with non­convex shapes.

k­means  Clustering Algorithm

Page 22: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

1. Will the algorithm terminate?

     In each step , we shift data item X i from cluster Cm to Cn only        when

 |  X i ­ Cn  |  <   | X i – Cm |

2. Will it find an optimal clustering?

k­means  Clustering Algorithm

Page 23: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

3. Why to choose Centroid in                 calculating Sum squared error ?

   The Partial derivative of error            w.r.t  center location must be zero.

k­means  Clustering Algorithm

Page 24: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Single­Link Algorithm O(n2)

Complete­Link Algorithm O(n3)

n mearging steps, each requies O(n2) comparisons

K ­means algorithm  O(n)

Each iteration requires O(n)

Constant number of iterations

Time Complexities

Page 25: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

K ­means algorithm is very efficient in terms of computational time,

but it is very sensitive to initial choice and it is less versatile.

On the other hand , hierarchical algorithms are more versatile ,but they are 

computationally expensive.

In Practise, K­means and its various forms are used for large data sets ahead 

of  Hierarchical algorithms.

Conclusions

Page 26: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

1.  A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a

     review.ACM Computing Surveys, 1999.

2. Andrew W. Moore. K­means and Hierarchical Clustering.

    www.cs.cmu.edu/~awm/tutorials.

3. Hassan. Clustering Algorithms.

    http://www.carleton.ca/~hmasum/clustering.html

References

Page 27: Vaibhav Selot (04305813) Md Shabeeruddin (04305901)cs621-2011/cs621-2007/old/old_site/seminar_slides/...Pattern Representation and Feature Selection/Extraction Patterns are represented

Observing Performance of various Clustering algorithms 

      on Sample data sets

Proposal for Project