Clustering. 2 Outline Introduction K-means clustering Hierarchical clustering: COBWEB.
Hierarchical clustering and k-mean...
Transcript of Hierarchical clustering and k-mean...
![Page 1: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/1.jpg)
Clustering Hierarchical clustering and k-mean clustering
Genome 373
Genomic Informatics
Elhanan Borenstein
![Page 2: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/2.jpg)
The clustering problem:
partition genes into distinct sets with high homogeneity and high separation
Many different representations
Many possible distance metrics Metric matters
Homogeneity vs separation
A quick review
![Page 3: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/3.jpg)
A good clustering solution should have two features:
1. High homogeneity: homogeneity measures the similarity between genes assigned to the same cluster.
2. High separation: separation measures the distance/dis-similarity between clusters. (If two clusters have similar expression patterns, then they should probably be merged into one cluster).
The clustering problem
![Page 4: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/4.jpg)
“Unsupervised learning” problem
No single solution is necessarily the true/correct!
There is usually a tradeoff between homogeneity and separation:
More clusters increased homogeneity but decreased separation
Less clusters Increased separation but reduced homogeneity
Method matters; metric matters; definitions matter;
There are many formulations of the clustering problem; most of them are NP-hard (why?).
In most cases, heuristic methods or approximations are used.
The “philosophy” of clustering
![Page 5: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/5.jpg)
Many algorithms: Hierarchical clustering
k-means
self-organizing maps (SOM)
Knn
PCC
CAST
CLICK
The results (i.e., obtained clusters) can vary drastically depending on: Clustering method
Parameters specific to each clustering method (e.g. number of centers for the k-mean method, agglomeration rule for hierarchical clustering, etc.)
One problem, numerous solutions
![Page 6: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/6.jpg)
Hierarchical clustering
![Page 7: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/7.jpg)
An agglomerative clustering method
Takes as input a distance matrix
Progressively regroups the closest objects/groups
The result is a tree - intermediate nodes represent clusters
Branch lengths represent distances between clusters
Hierarchical clustering
object 2
object 4
object 1
object 3
object 5
c1
c2
c3
c4
leaf
nodes
branch
node
root
Tree representation
ob
ject
1
ob
ject
2
ob
ject
3
ob
ject
4
ob
ject
5
object 1 0.00 4.00 6.00 3.50 1.00
object 2 4.00 0.00 6.00 2.00 4.50
object 3 6.00 6.00 0.00 5.50 6.50
object 4 3.50 2.00 5.50 0.00 4.00
object 5 1.00 4.50 6.50 4.00 0.00
Distance matrix
![Page 8: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/8.jpg)
mmm… Déjà vu anyone?
![Page 9: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/9.jpg)
1. Assign each object to a separate cluster.
2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.
3. Repeat 2 until there is a single cluster.
Hierarchical clustering algorithm
![Page 10: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/10.jpg)
1. Assign each object to a separate cluster.
2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.
3. Repeat 2 until there is a single cluster.
Hierarchical clustering algorithm
![Page 11: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/11.jpg)
Hierarchical clustering
One needs to define a (dis)similarity metric between two groups. There are several possibilities
Average linkage: the average distance between objects from groups A and B
Single linkage: the distance between the closest objects from groups A and B
Complete linkage: the distance between the most distant objects from groups A and B
1. Assign each object to a separate cluster.
2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.
3. Repeat 2 until there is a single cluster.
![Page 12: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/12.jpg)
Impact of the agglomeration rule These four trees were built from the same distance matrix,
using 4 different agglomeration rules.
Note: these trees were computed from a matrix of random numbers. The impression of structure is thus a complete artifact.
Single-linkage typically creates nesting clusters
Complete linkage create more balanced trees.
![Page 13: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/13.jpg)
Hierarchical clustering result
13
Five clusters
![Page 14: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/14.jpg)
K-mean clustering
Divisive
Non-hierarchical
![Page 15: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/15.jpg)
An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center
Note that this is a somewhat strange definition:
Assignment of a point to a cluster is based on the proximity of the point to the cluster mean
But the cluster mean is calculated based on all the points assigned to the cluster.
K-mean clustering
cluster_2 mean cluster_1
mean
![Page 16: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/16.jpg)
An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center
The chicken and egg problem: I do not know the means before I determine the partitioning into clusters I do not know the partitioning into clusters before I determine the means
Key principle - cluster around mobile centers:
Start with some random locations of means/centers, partition into clusters according to these centers, and then correct the centers according to the clusters (somewhat similar to expectation-maximization algorithm)
K-mean clustering: Chicken and egg
![Page 17: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/17.jpg)
The number of centers, k, has to be specified a-priori
Algorithm:
1. Arbitrarily select k initial centers
2. Assign each element to the closest center
3. Re-calculate centers (mean position of the assigned elements)
4. Repeat 2 and 3 until ….
K-mean clustering algorithm
![Page 18: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/18.jpg)
The number of centers, k, has to be specified a-priori
Algorithm:
1. Arbitrarily select k initial centers
2. Assign each element to the closest center
3. Re-calculate centers (mean position of the assigned elements)
4. Repeat 2 and 3 until one of the following termination conditions is reached:
i. The clusters are the same as in the previous iteration
ii. The difference between two iterations is smaller than a specified threshold
iii. The maximum number of iterations has been reached
K-mean clustering algorithm
How can we do this efficiently?
![Page 19: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/19.jpg)
Assigning elements to the closest center
Partitioning the space
B
A
![Page 20: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/20.jpg)
Assigning elements to the closest center
Partitioning the space
B
A
closer to A than to B
closer to B than to A
![Page 21: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/21.jpg)
Assigning elements to the closest center
Partitioning the space
B
A
C
closer to A than to B
closer to B than to A
closer to B than to C
![Page 22: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/22.jpg)
Assigning elements to the closest center
Partitioning the space
B
A
C
closest to A
closest to B
closest to C
![Page 23: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/23.jpg)
Assigning elements to the closest center
Partitioning the space
B
A
C
![Page 24: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/24.jpg)
Decomposition of a metric space determined by distances to a specified discrete set of “centers” in the space
Each colored cell represents the collection of all points in this space that are closer to a specific center s than to any other center
Several algorithms exist to find the Voronoi diagram.
Voronoi diagram
![Page 25: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/25.jpg)
The number of centers, k, has to be specified a priori
Algorithm:
1. Arbitrarily select k initial centers
2. Assign each element to the closest center (Voronoi)
3. Re-calculate centers (mean position of the assigned elements)
4. Repeat 2 and 3 until one of the following termination conditions is reached:
i. The clusters are the same as in the previous iteration
ii. The difference between two iterations is smaller than a specified threshold
iii. The maximum number of iterations has been reached
K-mean clustering algorithm
![Page 26: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/26.jpg)
K-mean clustering example Two sets of points
randomly generated 200 centered on (0,0)
50 centered on (1,1)
![Page 27: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/27.jpg)
K-mean clustering example Two points are
randomly chosen as centers (stars)
![Page 28: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/28.jpg)
K-mean clustering example Each dot can now
be assigned to the cluster with the closest center
![Page 29: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/29.jpg)
K-mean clustering example First partition into
clusters
![Page 30: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/30.jpg)
Centers are re-calculated
K-mean clustering example
![Page 31: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/31.jpg)
K-mean clustering example And are again used
to partition the points
![Page 32: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/32.jpg)
K-mean clustering example Second partition into
clusters
![Page 33: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/33.jpg)
K-mean clustering example Re-calculating centers
again
![Page 34: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/34.jpg)
K-mean clustering example And we can again
partition the points
![Page 35: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/35.jpg)
K-mean clustering example Third partition
into clusters
![Page 36: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/36.jpg)
K-mean clustering example After 6 iterations:
The calculated centers remains stable
![Page 37: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/37.jpg)
K-mean clustering: Summary The convergence of k-mean is usually quite fast
(sometimes 1 iteration results in a stable solution)
K-means is time- and memory-efficient
Strengths:
Simple to use
Fast
Can be used with very large data sets
Weaknesses:
The number of clusters has to be predetermined
The results may vary depending on the initial choice of centers
![Page 38: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/38.jpg)
K-mean clustering: Variations
Expectation-maximization (EM): maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.
k-means++: attempts to choose better starting points.
Some variations attempt to escape local optima by swapping points between clusters
![Page 39: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/39.jpg)
The take-home message
D’haeseleer, 2005
Hierarchical clustering
K-mean clustering
?
![Page 40: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/40.jpg)
![Page 41: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/41.jpg)
What else are we missing?
![Page 42: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/42.jpg)
What if the clusters are not “linearly separable”?
What else are we missing?
![Page 43: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK](https://reader034.fdocuments.us/reader034/viewer/2022042621/5f6604f41c1d5b20413f0f63/html5/thumbnails/43.jpg)
Clustering in both dimensions We can cluster genes, conditions (samples), or both.