Transcript of CSE4334/5334 Data Mining Clustering - University of...
CSE43345334 Data Mining Clustering
Chengkai Li University of Texas at Arlington Fall 2016
What is Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Discovered Clusters
Industry Group
1
Applied-Matl-DOWNBay-Network-Down3-COM-DOWN
Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN
DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN
Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down
Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN
Sun-DOWN
Technology1-DOWN
2
Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN
ADV-Micro-Device-DOWNAndrew-Corp-DOWN
Computer-Assoc-DOWNCircuit-City-DOWN
Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN
Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN
Technology2-DOWN
3
Fannie-Mae-DOWNFed-Home-Loan-DOWN
MBNA-Corp-DOWNMorgan-Stanley-DOWN
Financial-DOWN
4
Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP
Louisiana-Land-UPPhillips-Petro-UPUnocal-UP
Schlumberger-UP
Oil-UP
What is Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Discovered Clusters
Industry Group
1
Applied-Matl-DOWNBay-Network-Down3-COM-DOWN
Cabletron-Sys-DOWNCISCO-DOWNHP-DOWN
DSC-Comm-DOWNINTEL-DOWNLSI-Logic-DOWN
Micron-Tech-DOWNTexas-Inst-DownTellabs-Inc-Down
Natl-Semiconduct-DOWNOracl-DOWNSGI-DOWN
Sun-DOWN
Technology1-DOWN
2
Apple-Comp-DOWNAutodesk-DOWNDEC-DOWN
ADV-Micro-Device-DOWNAndrew-Corp-DOWN
Computer-Assoc-DOWNCircuit-City-DOWN
Compaq-DOWN EMC-Corp-DOWN Gen-Inst-DOWN
Motorola-DOWNMicrosoft-DOWNScientific-Atl-DOWN
Technology2-DOWN
3
Fannie-Mae-DOWNFed-Home-Loan-DOWN
MBNA-Corp-DOWNMorgan-Stanley-DOWN
Financial-DOWN
4
Baker-Hughes-UPDresser-Inds-UPHalliburton-HLD-UP
Louisiana-Land-UPPhillips-Petro-UPUnocal-UP
Schlumberger-UP
Oil-UP
What is not Cluster Analysis Supervised classification
Simple segmentation
Results of a query
Graph partitioning
Notion of a Cluster can be Ambiguous
How many clusters
Four Clusters Two Clusters
Six Clusters
Types of Clusterings A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Notion of a Cluster can be Ambiguous
How many clusters
Four Clusters Two Clusters
Six Clusters
Types of Clusterings A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusterings A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive
Fuzzy versus non-fuzzy
Partial versus complete
Heterogeneous versus homogeneous
Types of Clusters
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters
Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Well-Separated
Well-Separated Clusters
3 well-separated clusters
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Center-Based
Center-based
centroidmedoid
4 center-based clusters
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive)
8 contiguous clusters
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Density-Based
Density-based
6 density-based clusters
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Conceptual Clusters
Shared Property or Conceptual Clusters
2 Overlapping Circles
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Objective Function
Clusters Defined by an Objective Function
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Types of Clusters Objective Function hellip
Map the clustering problem to a different domain and solve a related problem in that domain
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Characteristics of the Input Data Are Important
Type of proximity or density measure
Sparseness
Attribute type
Type of Data
Dimensionality Noise and Outliers Type of Distribution
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Clustering Algorithms
K-means and its variants
Hierarchical clustering
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
K-means Clustering
o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters K must be specified o The basic algorithm is very simple
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
K-means Clustering ndash Details
Initial centroids are often chosen randomly
The centroid mi is (typically) the mean of the points in the cluster lsquoClosenessrsquo is measured by Euclidean distance cosine similarity
correlation etc K-means will converge for common similarity measures mentioned
above Most of the convergence happens in the first few iterations
Complexity is O( n K I d )
sumisin
=iCxi
i xC
m||
1
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Two different K-means Clusterings
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Sub-optimal Clustering -2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Optimal Clustering
Original Points
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Importance of Choosing Initial Centroids
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 6
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
sumsum= isin
=K
i Cxi
i
xmdistSSE1
2 )(
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 5
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Importance of Choosing Initial Centroids hellip
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 1
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 2
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 3
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
x
y
Iteration 4
-2 -15 -1 -05 0 05 1 15 2
0
05
1
15
2
25
3
xy
Iteration 5
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Problems with Selecting Initial Points
If there are K lsquorealrsquo clusters then the chance of selecting one centroid from each cluster is small
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
10 Clusters Example
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
10 Clusters Example
Starting with some pairs of clusters having three initial centroids while other have only one
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
yIteration 1
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 2
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 3
0 5 10 15 20
-6
-4
-2
0
2
4
6
8
x
y
Iteration 4
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Solutions to Initial Centroids Problem
o Multiple runs
o Sample and use hierarchical clustering to determine initial centroids
o Select more than k initial centroids and then select among these initial centroids
o Postprocessing o Bisecting K-means
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Updating Centers Incrementally
In the basic K-means algorithm centroids are updated after all points are assigned to a centroid
An alternative is to update the centroids after each assignment (incremental approach)
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Pre-processing and Post-processing
Pre-processing
Post-processing
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Bisecting K-means
Bisecting K-means algorithm
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Bisecting K-means Example
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of K-means
K-means has problems when clusters are of differing
K-means has problems when the data contains outliers
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of K-means Differing Sizes
Original Points K-means (3 Clusters)
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of K-means Differing Density
Original Points K-means (3 Clusters)
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of K-means Non-globular Shapes
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters Find parts of clusters but need to put together
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Overcoming K-means Limitations
Original Points K-means Clusters
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Overcoming K-means Limitations
Original Points K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
1 3 2 5 4 60
005
01
015
02
1
2
3
4
5
6
1
23 4
5
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
They may correspond to meaningful taxonomies
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Hierarchical Clustering
Two main types of hierarchical clustering
Traditional hierarchical algorithms use a similarity or distance matrix
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Agglomerative Clustering Algorithm More popular hierarchical clustering technique
Basic algorithm is straightforward
Key operation is the computation of the proximity of two clusters
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Intermediate Situation
After some merging steps we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
After Merging
The question is ldquoHow do we update the proximity matrixrdquo
C1
C4
C2 U C5
C3
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Similarity
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5
Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function ndash Wardrsquos Method uses squared error
times times
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Cluster Similarity MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet1
Sheet2
Sheet3
Hierarchical Clustering MIN
Nested Clusters Dendrogram
1
2 3
4
5
6
1 2
3
4
5
3 6 2 5 4 10
005
01
015
02
Strength of MIN
Original Points Two Clusters
bull Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet2
Sheet3
Hierarchical Clustering MIN
Nested Clusters Dendrogram
1
2 3
4
5
6
1 2
3
4
5
3 6 2 5 4 10
005
01
015
02
Strength of MIN
Original Points Two Clusters
bull Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet3
Hierarchical Clustering MIN
Nested Clusters Dendrogram
1
2 3
4
5
6
1 2
3
4
5
3 6 2 5 4 10
005
01
015
02
Strength of MIN
Original Points Two Clusters
bull Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Hierarchical Clustering MIN
Nested Clusters Dendrogram
1
2 3
4
5
6
1 2
3
4
5
3 6 2 5 4 10
005
01
015
02
Strength of MIN
Original Points Two Clusters
bull Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Strength of MIN
Original Points Two Clusters
bull Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of MIN
Original Points Two Clusters
bull Sensitive to noise and outliers
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Cluster Similarity MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet1
Sheet2
Sheet3
Hierarchical Clustering MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
03
035
04
1
2
3
4
5
6 1
2 5
3
4
Strength of MAX
Original Points Two Clusters
bull Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet2
Sheet3
Hierarchical Clustering MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
03
035
04
1
2
3
4
5
6 1
2 5
3
4
Strength of MAX
Original Points Two Clusters
bull Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet3
Hierarchical Clustering MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
03
035
04
1
2
3
4
5
6 1
2 5
3
4
Strength of MAX
Original Points Two Clusters
bull Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Hierarchical Clustering MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
03
035
04
1
2
3
4
5
6 1
2 5
3
4
Strength of MAX
Original Points Two Clusters
bull Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Strength of MAX
Original Points Two Clusters
bull Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Limitations of MAX
Original Points Two Clusters
bullTends to break large clusters
bullBiased towards globular clusters
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Cluster Similarity Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters
Need to use average connectivity for scalability since total proximity favors large clusters
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet1
Sheet2
Sheet3
Hierarchical Clustering Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
1
2
3
4
5
6 1
2
5
3
4
Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths
Limitations
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
I1
I2
I3
I4
I5
I1
100
090
010
065
020
I2
090
100
070
060
050
I3
010
070
100
040
030
I4
065
060
040
100
080
I5
020
050
030
080
100
Sheet2
Sheet3
Hierarchical Clustering Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
1
2
3
4
5
6 1
2
5
3
4
Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths
Limitations
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Sheet3
Hierarchical Clustering Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
1
2
3
4
5
6 1
2
5
3
4
Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths
Limitations
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Hierarchical Clustering Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
005
01
015
02
025
1
2
3
4
5
6 1
2
5
3
4
Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths
Limitations
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Hierarchical Clustering Group Average Compromise between Single and Complete Link Strengths
Limitations
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Cluster Similarity Wardrsquos Method
Similarity of two clusters is based on the increase in squared error when two clusters are merged
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Hierarchical Clustering Comparison
Group Average
Wardrsquos Method
1
2 3
4
5
6 1
2
5
3
4
MIN MAX
1
2 3
4
5
6 1
2
5
3 4
1
2 3
4
5
6 1
2 5
3
4 1
2 3
4
5
6 1
2
3
4
5
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Hierarchical Clustering Time and Space requirements
O(N2) space since it uses the proximity matrix
O(N3) time in many cases
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Hierarchical Clustering Problems and Limitations
Once a decision is made to combine two clusters it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
MST Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is
For cluster analysis the analogous question is how to evaluate the ldquogoodnessrdquo of the resulting clusters But ldquoclusters are in the eye of the beholderrdquo Then why do we want to evaluate them
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Clusters found in Random Data
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Random Points
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Complete Link
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
1 Determining the clustering tendency of a set of data ie distinguishing whether non-random structure actually exists in the data
2 Comparing the results of a cluster analysis to externally known results eg to externally given class labels
3 Evaluating how well the results of a cluster analysis fit the data without reference to external information
- Use only the data 4 Comparing the results of two different sets of cluster analyses to
determine which is better 5 Determining the lsquocorrectrsquo number of clusters For 2 3 and 4 we can further distinguish whether we want to
evaluate the entire clustering or just individual clusters
Different Aspects of Cluster Validation
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Numerical measures that are applied to judge various aspects of cluster validity are classified into the following three types o External Index
o Internal Index
o Relative Index
Sometimes these are referred to as criteria instead of indices
Measures of Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Two matrices
Compute the correlation between the two matrices
High correlation indicates that points that belong to the same cluster are close to each other
Not a good measure for some density or contiguity based clusters
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Order the similarity matrix with respect to cluster labels and inspect visually
Using Similarity Matrix for Cluster Validation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
DBSCAN
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Points
Poin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
01
02
03
04
05
06
07
08
09
1
Complete Link
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Using Similarity Matrix for Cluster Validation
1 2
3
5
6
4
7
DBSCAN
0
01
02
03
04
05
06
07
08
09
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Clusters in more complicated figures arenrsquot well separated Internal Index Used to measure the goodness of a clustering structure
without respect to external information
SSE is good for comparing two clusterings or two clusters (average SSE)
Can also be used to estimate the number of clusters
Internal Measures SSE
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Internal Measures SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Need a framework to interpret any measure
Statistics provide a framework for cluster validity
For comparing the results of two different sets of cluster analyses a framework is less necessary
Framework for Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Example
Statistical Framework for SSE
0016 0018 002 0022 0024 0026 0028 003 0032 00340
5
10
15
20
25
30
35
40
45
50
SSE
Cou
nt
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets
Statistical Framework for Correlation
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
0 02 04 06 08 10
01
02
03
04
05
06
07
08
09
1
x
y
Corr = -09235 Corr = -05810
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Cluster Cohesion Measures how closely related are objects in a cluster
Cluster Separation Measure how distinct or well-separated a cluster is from other clusters
Example Squared Error
Internal Measures Cohesion and Separation
sum sumisin
minus=i Cx
ii
mxWSS 2)(
sum minus=i
ii mmCBSS 2)(
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Internal Measures Cohesion and Separation
Example SSE
1 2 3 4 5 times times times m1 m2
m
10919)354(2)513(2
1)545()544()512()511(22
2222
=+==minustimes+minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=2 clusters
100100)33(4
10)35()34()32()31(2
2222
=+==minustimes=
=minus+minus+minus+minus=
TotalBSSWSSK=1 cluster
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
A proximity graph based approach can also be used for cohesion and separation
Internal Measures Cohesion and Separation
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
Silhouette Coefficient combine ideas of both cohesion and separation but for individual points as well as clusters and clusterings
For an individual point i
Can calculate the Average Silhouette width for a cluster or a clustering
Internal Measures Silhouette Coefficient
ab
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
External Measures of Cluster Validity Entropy and Purity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity
Final Comment on Cluster Validity
ldquoThe validation of clustering structures is the most difficult and frustrating part of cluster analysis
Without a strong effort in this direction cluster analysis will remain a black art accessible only to those true believers who have experience and great couragerdquo
Algorithms for Clustering Data Jain and Dubes
Final Comment on Cluster Validity
Slide Number 1
What is Cluster Analysis
Applications of Cluster Analysis
What is not Cluster Analysis
Notion of a Cluster can be Ambiguous
Types of Clusterings
Partitional Clustering
Hierarchical Clustering
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters Well-Separated
Types of Clusters Center-Based
Types of Clusters Contiguity-Based
Types of Clusters Density-Based
Types of Clusters Conceptual Clusters
Types of Clusters Objective Function
Types of Clusters Objective Function hellip
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering ndash Details
Two different K-means Clusterings
Importance of Choosing Initial Centroids
Importance of Choosing Initial Centroids
Evaluating K-means Clusters
Importance of Choosing Initial Centroids hellip
Importance of Choosing Initial Centroids hellip
Problems with Selecting Initial Points
10 Clusters Example
10 Clusters Example
10 Clusters Example
10 Clusters Example
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Bisecting K-means Example
Limitations of K-means
Limitations of K-means Differing Sizes
Limitations of K-means Differing Density
Limitations of K-means Non-globular Shapes
Overcoming K-means Limitations
Overcoming K-means Limitations
Overcoming K-means Limitations
Hierarchical Clustering
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
How to Define Inter-Cluster Similarity
Cluster Similarity MIN or Single Link
Hierarchical Clustering MIN
Strength of MIN
Limitations of MIN
Cluster Similarity MAX or Complete Linkage
Hierarchical Clustering MAX
Strength of MAX
Limitations of MAX
Cluster Similarity Group Average
Hierarchical Clustering Group Average
Hierarchical Clustering Group Average
Cluster Similarity Wardrsquos Method
Hierarchical Clustering Comparison
Hierarchical Clustering Time and Space requirements
Hierarchical Clustering Problems and Limitations
MST Divisive Hierarchical Clustering
MST Divisive Hierarchical Clustering
Cluster Validity
Clusters found in Random Data
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Measuring Cluster Validity Via Correlation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Using Similarity Matrix for Cluster Validation
Internal Measures SSE
Internal Measures SSE
Framework for Cluster Validity
Statistical Framework for SSE
Statistical Framework for Correlation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Cohesion and Separation
Internal Measures Silhouette Coefficient
External Measures of Cluster Validity Entropy and Purity