Vaibhav Selot (04305813) Md Shabeeruddin...
Transcript of Vaibhav Selot (04305813) Md Shabeeruddin...
Clustering Algorithms
Vaibhav Selot (04305813)Md Shabeeruddin (04305901)
Roadmap
» What is Clustering?
» Motivation for Clustering Algorithms.
» Components of Clustering task.
» Issues in Clustering Algorithms.
» Clustering Algorithms.
» Conclusion.
What is Clustering?
Clustering is Unsupervised classification of patterns
based on similarity. A cluster is therefore a collection
of objects which are similar
between them and are dissimilar
to the objects belonging to
other clusters.
Motivation
• Supervised classification vs Unsupervised classification
1. Provided with preclassified 1. No such supervision in
(labeled) patterns. this Classification.
2. Problem is to correctly classify 2. Problem is to group a given
a new pattern. collection of unlabeled patterns into meaningful clusters.
• Exploratory Data Analysis applications like Data Mining , Face Recognition,Signature Matching, FingerPrint ,Image segmentation, Knowledge Acquisition have little prior knowledge of information about the data.
• Clustering can be used for exploring interrelationships among the data points in these applications.
Notations
Pattern X = (x 1 , . . . .x d)
where xi = feature or attribute
d = dimensionality of pattern space
Pattern set p = {X1 , . . . ,Xn}
p is n x d pattern matrix.
Components of Clustering Tasks
● Pattern representation and
Feature selection / extraction.● Similarity Measure.● Clustering algorithm.● Data abstraction. ● Assessment of output .
Pattern Representation and Feature Selection/Extraction
● Pa tte rns a re rep resented by n-dim ensiona l fea tu re v ec tors .
● T he m ost disc rim ina ting fea tu re s a re se lec ted.
● N ew fea tu res a re com p u ted u sing fea tu re ex trac tion
techniq u es.
● T o redu ce the dim ensiona lity of p roblem sp ace , only su bse t of
fea tu res a re se lec ted for c lu ste ring a lgorithm .
● Fea tu re se lec tion/ex trac tion needs good dea l of dom ain
know ledge .
This is a function which takes two sets of data items as input, and
returns as output a similarity measure between them.
Conventional Similarity measure is distance.
1)Minkowski distance
2) Manhattan distance
3)Euclidean distance
Similarity Measures
4) Mutual Neighbor Distance (MND)
s(Xi,Xj)=f(Xi,Xj,ξ)
where ξ is the Context
MND(Xi,Xj)=NN(Xi,Xj)+NN(Xj,Xi)
NN(Xi,Xj) is neighbor number of Xj w.r.t Xi
In fig 1. NN(A,B)=NN(B,A)=1
NN(B,C)=1 NN(C,B)=2
MND(A,B)=2 MND(B,C)=3
In fig 2.
MND(A,B)=5 MND(B,C)=3
Similarity Measures
5) Conceptual Measure
s(X i,X j)= f(X i,X j,ξ,ζ)
where
ξ is the C ontex t
ζ is a set of Predefined concepts
Similarity Measures
• Clustering algorithm uses a similarity measure.
• Choice of Clustering algorithm depends on the desired properties of the final clustering result.
• Time and space complexity also affect the above choice.
Clustering algorithm
• Each cluster resulted due to clustering is compactly described in terms of representative patterns such as centroids.
• Assessment of Clustering output is based on specific criterion of optimality. This criterion selected depends on the Domain.
Data Abstraction & Assessment of output
1.Agglomerative vs. divisive
2.Monothetic vs. polythetic
3.Hard vs. fuzzy
4.Deterministic vs. stochastic
5.Incremental vs. non-incremental
Issues in Clustering Techniques
Taxonomy of Clustering Techniques
Produces a nested series of partitions based on criterion for merging/splitting
based on similarity.Algorithm
1. Start with each data item in a distinct cluster.
2. Merge the two clusters with min distance.
3. Continue step 2 , until clustering criterion is satisfied.
Clustering Criterion
# of desired Clusters.
Threshold on Distance.
Hierarchical Clustering
Distance MeasurementSinglelink Clustering
Min of each pairwise distances of the data items of two clusters.
Completelink Clustering
Max of each pairwise distances of the data items of two clusters.
Hierarchical Clustering
Dendrogram
Hierarchical Clustering
Disadvantage of Singlelink Chaining Effect
Singlelink Completelink
Hierarchical Clustering
Disadvantage in Hierarchical Clustering
For large data sets, constructing Dendrogram is computationly
expensive.
Partitional Clustering
Identifies the partition that optimizes a Criterion function.
Criterion function
where X(j)i is the i th pattern belonging to the j th cluster
Cj is the centroid of the j th cluster.
nj is # of patterns in j th cluster. K is # of clusters.
P is Pattern set. L is set of clusters.
Partitional Clustering
kmeans Clustering Algorithm
1.Start with k cluster and initialize each with a randomlychosen patterns .
2.Assign each pattern to the closest cluster center.
3.Recompute the cluster centers using the current cluster memberships. If a convergence criterion is not met,
go to step 2.
Typical convergence criteria are:
no (or minimal) reassignment of patterns to new cluster, or minimal decrease in squared error.
Partitional Clustering
Example
Disadvantage
1. # of clusters should be
known in advance.
2. Algorithm is sensitive
to the selection of initial
partition.
3. Not suitable to discover
clusters with nonconvex shapes.
kmeans Clustering Algorithm
1. Will the algorithm terminate?
In each step , we shift data item X i from cluster Cm to Cn only when
| X i Cn | < | X i – Cm |
2. Will it find an optimal clustering?
kmeans Clustering Algorithm
3. Why to choose Centroid in calculating Sum squared error ?
The Partial derivative of error w.r.t center location must be zero.
kmeans Clustering Algorithm
SingleLink Algorithm O(n2)
CompleteLink Algorithm O(n3)
n mearging steps, each requies O(n2) comparisons
K means algorithm O(n)
Each iteration requires O(n)
Constant number of iterations
Time Complexities
K means algorithm is very efficient in terms of computational time,
but it is very sensitive to initial choice and it is less versatile.
On the other hand , hierarchical algorithms are more versatile ,but they are
computationally expensive.
In Practise, Kmeans and its various forms are used for large data sets ahead
of Hierarchical algorithms.
Conclusions
1. A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a
review.ACM Computing Surveys, 1999.
2. Andrew W. Moore. Kmeans and Hierarchical Clustering.
www.cs.cmu.edu/~awm/tutorials.
3. Hassan. Clustering Algorithms.
http://www.carleton.ca/~hmasum/clustering.html
References
Observing Performance of various Clustering algorithms
on Sample data sets
Proposal for Project