Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a...

43
Clustering: Hierarchical -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1

Transcript of Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a...

Page 1: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Clustering: Hierarchical-Applied Multivariate Analysis-

Lecturer: Darren Homrighausen, PhD

1

Page 2: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

From K -means to hierarchical clustering

Recall two properties of K -means clustering

1. It fits exactly K clusters.

2. Final clustering assignments depend on the chosen initialcluster centers.

Alternatively, we can use hierarchical clustering. This has theadvantage that

1. No need to choose the number of clusters before hand.

2. There is no random component (nor choice of starting point).

2

Page 3: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

From K -means to hierarchical clustering

There is a catch: we need to choose a way to measure dissimilaritybetween clusters, called the linkage.

Given the linkage, hierarchical clustering produces a sequence ofclustering assignments.

At one end, all points are in their own cluster.

At the other, all points are in one cluster.

In the middle, there are nontrivial solutions.

3

Page 4: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Agglomerative vs. divisive

Two types of hierarchical clustering algorithms

Agglomerative: Start with each point in its own cluster.Merge until all in same cluster.

(ie: top-down) (think of forward selection)

Divisive: Until every point is assigned to its own(ie: bottom-up) cluster, repeatedly split the group into two

parts that result in the biggest dissimilarity(think of backwards selection).

Agglomerative methods are simpler, so we’ll focus on them.

4

Page 5: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Agglomerative example

Given these data points, an agglomerative algorithm might decideon the following clustering sequence:(Important: Different choices of linkage would result in different solutions)

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

12

3

45

6

7

1. {1}, {2}, {3}, {4}, {5}, {6}, {7}2. {1, 2}, {3}, {4}, {5}, {6}, {7}3. {1, 2}, {3}, {5}, {4, 7}4. {1, 2, 6}, {3}, {5}, {4, 7}5. {1, 2, 4, 6, 7}, {3}, {5}6. {1, 2, 3, 4, 6, 7}, {5}7. {1, 2, 3, 4, 5, 6, 7}

5

Page 6: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Reminder: What’s a dendrogram?We encountered dendrograms when we talked about classificationand regression trees.

Dendrogram: A convenient graphic to display a hierarchicalsequence of clustering assignments. This is simply a tree where:

• Each branch represents a group

• Each leaf (terminal) node is a singleton(ie: a group containing a single data point)

• The root node is a group containing the whole data set

• Each internal node as two daughter nodes (children),representing the groups that were merged to form it.

Remember: the choice of linkage determines how we measuredissimilarity between groups.

Each internal node is drawn at a height proportional to the linkagedistance between its two daughter nodes.

6

Page 7: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

We can also represent the sequence of clustering assignments as adendrogram

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

12

3

45

6

7

5

3

4 7

6

1 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

hclust (*, "complete")

Hei

ght

Note that cutting the dendrogram horizontally partitions the datapoints into clusters

7

Page 8: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Back to the example

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

12

3

45

6

7

5

3

4 7

6

1 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

hclust (*, "complete")

Hei

ght

For instance, the linkage distance between the cluster {4, 7} andthe cluster {1, 2, 6} is about .65.

8

Page 9: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Linkages

Notation: Define X1, . . . ,Xn to be the data

Let the dissimiliarities be dij between each pair Xi ,Xj

At any level, clustering assignments can be expressed by setsG = {i1, i2, . . . , ir}. given the indices of points in this group.Define |G | to be the size of G .

Linkage: The function d(G ,H) that takes two groups G ,H andreturns the linkage distance between them.

Agglomerative clustering, given the linkage:

• Start with each point in its own group

• Until there is only one cluster, repeatedly merge the twogroups G ,H that minimize d(G ,H).

9

Page 10: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Single linkageIn single linkage (a.k.a nearest-neighbor linkage), the linkagedistance between G ,H is the smallest dissimilarity between twopoints in different groups:

dsingle(G ,H) = mini∈G , j∈H

dij

Example: There are two clustersG and H (red and blue).The single linkage distance(i.e. dsingle(G ,H))

is the dissimilarity between theclosest pair(length of black line segment)

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x1

x2d(G, H)

10

Page 11: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Single linkage exampleHere n = 60, Xi ∈ R2, dij = ||Xi − Xj ||2. Cutting the tree ath = 0.8 gives the cluster assignments marked by colors

14 247

3756

15 3911

2854 58

314 55 8 50

648

4725 59

309 44

2218 19

4321 40 20

51 5 1223

27 3233 53

313 29 36

217 38 41

45 5257

42 6034

4916 26

3546

1 10

0.0

0.2

0.4

0.6

0.8

1.0

hclust (*, "single")

Hei

ght

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2X1

X2

Cut interpretation: For each point Xi , there is another point Xj inthe same cluster with dij ≤ 0.8 (assuming more than 1 point incluster). Also, no points in different clusters are closer than 0.8.

11

Page 12: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Complete linkageIn complete linkage (i.e. farthest-neighbor linkage), dissimiliartybetween G ,H is the largest dissimilarity between two points indifferent clusters:

dcomplete(G ,H) = maxi∈G , j∈H

dij .

Example: There are two clustersG and H (red and blue).The complete linkage distance(i.e. dcomplete(G ,H))

is the dissimilarity between thefarthest pair(length of black line segment)

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x1

x2

d(G, H)

12

Page 13: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Complete linkage example

Same data as before. Cutting the tree at h = 3.5 gives theclustering assignment

3546

1 10 8 5048

4725 59

2327 32

33 537

15 3911

564 55 22 18 19

230 9 44

43 2051 5 12

3121 40

14 2434

49 16 266

29 363 13

2854 58

3745 52 17

38 4157

42 60

01

23

4

hclust (*, "complete")

Hei

ght

●●

● ●

●●

●●

●●●

−2 −1 0 1 2−

10

12

X1

X2

Cut interpretation: For each point Xi , every other point Xj in thesame cluster has dij ≤ 3.5.

13

Page 14: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Average linkageIn average linkage, the linkage distance between G ,H is theaverage dissimilarity over all points in different clusters:

daverage(G ,H) =1

|G | · |H|∑

i∈G , j∈Hdij .

Example: There are two clustersG and H (red and blue).The average linkage distance(i.e. daverage(G ,H))

is the average dissimilarity betweenall points in different clusters(average of lengths of colored line segments)

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x1

x2

14

Page 15: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Average linkage example

Same data as before. Cutting the tree at h = 1.75 gives theclustering assignment

3546

1 1014 24

3745 52

2854 58

1156

314 55

715 39

21 4043

2051 5 12

8 50 2327 32

33 5330

9 4422

18 1948

4725 59

217

38 4157

42 6034

49 16 266

29 363 13

0.0

0.5

1.0

1.5

2.0

2.5

hclust (*, "average")

Hei

ght

●●

● ●

●●

●●

●●●

−2 −1 0 1 2−

10

12

X1

X2

Cut interpretation: ??

15

Page 16: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Common properties

Single, complete, and average linkage share the following:

• They all operate on the dissimilarities dij . This means that thepoints we are clustering can be quite general (number ofmutations on a genome, polygons, faces, whatever).

• Running agglomerative clustering with any of these linkagesproduces a dendrogram with no inversions.

No inversions means that the linkage distance between mergedclusters only increases as we run the algorithm.

In other words, we can draw a proper dendrogram, where theheight of a parent is always higher than the height of eitherdaughter.(We’ll return to this again shortly)

16

Page 17: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Shortcomings of single and complete linkage

Single and complete linkage have practical problems:

Single linkage: Often suffers from chaining, that is,we only need a single pair ofpoints to be close to merge two clusters.Therefore, clusters can be too spread outand not compact enough.

Complete linkage: Often suffers from crowding, that is,a point can be closer to points inother clusters than to points in its owncluster. Therefore, the clusters arecompact, but not far enough apart.

Average linkage tries to strike a balance between these two.

17

Page 18: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Example of chaining and crowding●

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

Single linkage Complete linkage●

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

Average linkage18

Page 19: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Shortcomings of average linkage

Average linkage isn’t perfect.

• It isn’t clear what properties the resulting clusters have whenwe cut an average linkage tree.

• Results of average linkage clustering can change with amonotone increasing transformation of the dissimilarities (thatis, if we changed the distance, but maintained the ranking ofthe distances, the cluster solution could change).

Neither of these problems afflict single or complete linkage.

19

Page 20: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Example of monotone increasing problem

Average

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

Single

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−1

01

2

X1

X2

●●

● ●

●●

●●

●●●

−2 −1 0 1 2−

10

12

X1

X2

Left: dij = ||Xi − Xj ||2 Right: dij = ||Xi − Xj ||2220

Page 21: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Hierarchical agglomerative clustering in R

There’s an easy way to generate distance-based dissimilarities in R

> x = c(1,2)

> y = c(2,1)

> dist(cbind(x,y),method=’euclidean’)

1

2 1.414214

> dist(cbind(x,y),method=’euclidean’,diag=T,upper=T)

1 2

1 0.000000 1.414214

2 1.414214 0.000000

21

Page 22: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Hierarchical agglomerative clustering in RThe function hclust is in base R

Delta = dist(x)

out.average = hclust(Delta,method=’average’)

plot(out.average)

abline(h = 1.75,col=’red’)

3546

1 1014 24

3745 52

2854 58

1156

314 55

715 39

21 4043

2051 5 12

8 50 2327 32

33 5330

9 4422

18 1948

4725 59

217

38 4157

42 6034

49 16 266

29 363 13

0.0

0.5

1.0

1.5

2.0

2.5

hclust (*, "average")

Hei

ght

cutree(out.tree,k=3)

cutree(out.tree,h=1.75)

22

Page 23: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Recap

Hierarchical agglomerative clustering: Start with alldata points in their own groups, and repeatedly merge groups,based on linkage function. Stop when points are in one group(this is agglomerative; there is also divisive)

This produces a sequence of clustering assignments, visualized by adendrogram (i.e., a tree). Each node in the tree represents agroup, and its height is proportional to the linkage distance of itsdaughters

Three most common linkage functions: single, complete, averagelinkage. Single linkage measures the least dissimilar pair betweengroups, complete linkage measures the most dissimilar pair,average linkage measures the average dissimilarity over all pairs

Each linkage has its strengths and weaknesses

23

Page 24: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful example

24

Page 25: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful Example

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x1

1 23

4

567

0.12

0.36

7 6

4 5 1

2 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

single

hclust (*, "single")Delta

Hei

ght

1

2 3

7

6

4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

complete

7

1

2 3

6

4 5

0.0

0.2

0.4

0.6

0.8

1.0

average

Hei

ght

Distance matrix (∆)

1 2 3 4 5 6 7

1 0.00 0.37 0.49 0.61 0.96 1.24 1.42

2 0.37 0.00 0.12 0.64 0.98 1.44 1.24

3 0.49 0.12 0.00 0.65 0.97 1.48 1.15

4 0.61 0.64 0.65 0.00 0.36 0.85 0.90

5 0.96 0.98 0.97 0.36 0.00 0.71 0.72

6 1.24 1.44 1.48 0.85 0.71 0.00 1.39

7 1.42 1.24 1.15 0.90 0.72 1.39 0.00

(All Merging {1} and {2, 3})Single: 0.37Complete: 0.49Average: (0.37 + 0.49)/2 = 0.43

(Next Agglomeration)Single: Merging {4, 5} & {1, 2, 3}

0.61Complete: Merging {4, 5} & {6}

0.85Average: Merging {4, 5} & {6}

(0.85+0.71)/2 = 0.78

25

Page 26: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Another linkageCentroid linkage is a commonly used and relatively new approach.Assume

• Xi ∈ Rp

• dij = ||Xi − Xj ||2

Let XG and XH denote group averages for G ,H. Then

dcentroid = ||XG − XH ||2

Example: There are two clusters(red and blue). The centroidlinkage distance(dcentroid(G ,H)) is the distancebetween the centroids (blackline segment).

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x1

x2

XG

XH

26

Page 27: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Centroid linkage

Centroid linkage is

• ... quite intuitive

• ... widely used

• ... nicely analogous to K -means.

• ... very related to average linkage (and much, much faster)

However, it has a very unsavory feature: inversions.

An inversion is when an agglomeration doesn’t reduce the linkagedistance.

27

Page 28: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Centroid linkage example

Same data as before. We can’t look at cutting the tree, but wecan still look at a 3 cluster solution.

14 2435

461 10

715 39

1156

314 55 8 50

23 27 3233 53

21 4043

2051 5 12

22 18 19 309 44

4847

25 592

17 38 4157 42 60

3449 16 26

63

13 29 3637

2845 52

54 58

0.0

0.5

1.0

1.5

2.0

hclust (*, "centroid")

Hei

ght

●●

● ●

●●

●●

●●●

−2 −1 0 1 2−

10

12

X1

X2

Cut interpretation: Even if there are no inversions, there still is nocut interpretation.

28

Page 29: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful Example: Steps 1,2,3

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x1

1

23

45 6

7

7

4 6

5

1 2 3

0.0

0.5

1.0

1.5

Cluster Dendrogram

hclust (*, "centroid")Delta

Hei

ght

Distance matrix (∆)

1 2 3 4 5 6 7

1 0.00 0.40 0.35 1.62 1.20 2.16 3.67

2 0.40 0.00 0.34 1.96 0.37 2.54 3.66

3 0.35 0.34 0.00 0.68 0.42 1.05 1.94

4 1.62 1.96 0.68 0.00 1.45 0.04 0.46

5 1.20 0.37 0.42 1.45 0.00 1.86 2.35

6 2.16 2.54 1.05 0.04 1.86 0.00 0.27

7 3.67 3.66 1.94 0.46 2.35 0.27 0.00

(This is squared Euclidean distance)

Centroid(4,6) = (1.68,0.76)Centroid(2,3) = (0.58,1.25)Centroid distance2: 1.46

29

Page 30: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful Example: Step 4

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x1

1

23

45 6

7

0.29

0.31

0.36

7

4 6

5

1 2 3

0.0

0.5

1.0

1.5

centroid

hclust (*, "centroid")Delta

Hei

ght

Distance matrix (∆)

1 2 3 4 5 6 7

1 0.00 0.40 0.35 1.62 1.20 2.16 3.67

2 0.40 0.00 0.34 1.96 0.37 2.54 3.66

3 0.35 0.34 0.00 0.68 0.42 1.05 1.94

4 1.62 1.96 0.68 0.00 1.45 0.04 0.46

5 1.20 0.37 0.42 1.45 0.00 1.86 2.35

6 2.16 2.54 1.05 0.04 1.86 0.00 0.27

7 3.67 3.66 1.94 0.46 2.35 0.27 0.00

(This is squared Euclidean distance)

Which one gets merged?

({1} and {2, 3})

method = ’centroid’

out = hclust(Delta,method=method)

rect.hclust(out,k=5,

border=c(’white’,’red’,’white’,’white’,’blue’))

30

Page 31: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful Example: Step 4

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x1

1

23

45 6

7

0.29

0.31

0.36

7

4 6

5

1 2 3

0.0

0.5

1.0

1.5

centroid

hclust (*, "centroid")Delta

Hei

ght

Distance matrix (∆)

1 2 3 4 5 6 7

1 0.00 0.40 0.35 1.62 1.20 2.16 3.67

2 0.40 0.00 0.34 1.96 0.37 2.54 3.66

3 0.35 0.34 0.00 0.68 0.42 1.05 1.94

4 1.62 1.96 0.68 0.00 1.45 0.04 0.46

5 1.20 0.37 0.42 1.45 0.00 1.86 2.35

6 2.16 2.54 1.05 0.04 1.86 0.00 0.27

7 3.67 3.66 1.94 0.46 2.35 0.27 0.00

(This is squared Euclidean distance)

Which one gets merged?({1} and {2, 3})

method = ’centroid’

out = hclust(Delta,method=method)

rect.hclust(out,k=5,

border=c(’white’,’red’,’white’,’white’,’blue’))

30

Page 32: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Careful Example: Step 4

●●

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x1

1

23

45 6

7

0.29

0.31

0.36

7

4 6

5

1 2 3

0.0

0.5

1.0

1.5

centroid

hclust (*, "centroid")Delta

Hei

ght

Distance matrix (∆)

1 2 3 4 5 6 7

1 0.00 0.40 0.35 1.62 1.20 2.16 3.67

2 0.40 0.00 0.34 1.96 0.37 2.54 3.66

3 0.35 0.34 0.00 0.68 0.42 1.05 1.94

4 1.62 1.96 0.68 0.00 1.45 0.04 0.46

5 1.20 0.37 0.42 1.45 0.00 1.86 2.35

6 2.16 2.54 1.05 0.04 1.86 0.00 0.27

7 3.67 3.66 1.94 0.46 2.35 0.27 0.00

(This is squared Euclidean distance)

Which one gets merged?({1} and {2, 3})

method = ’centroid’

out = hclust(Delta,method=method)

rect.hclust(out,k=5,

border=c(’white’,’red’,’white’,’white’,’blue’))

30

Page 33: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Linkage Summary

31

Page 34: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Linkages summary

No inversions?

Unchangedw/ monotonetransforma-tion?

Cutinterpre-tation?

Notes

Single X X X chaining

Complete X X X crowding

Average X X X

Centroid X X X inversions

Final notes:

• None of this helps determine what is the best linkage• Use the linkage that seems the most appropriate for the types

of clusters you want to get32

Page 35: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Designing a clever radio system

We have a lot of songs and dissimilarities between them (dij)

We want to build a clever radio system that takes a song specifiedby the user and produces a song of the “same” type

We ask the user how “risky” he or she wants to be

How can we use hierarchical clustering and with what linkage?

33

Page 36: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Linkages summary: Cut interpretations

Suppose we cut the tree at height h = 1.

Single For each point Xi , there is another point Xj in thesame cluster with dij ≤ 1 (assuming more than1 point in cluster). Also, no points in different clustersare closer than 1.

Complete For each point Xi , every other point Xj in the samecluster has dij ≤ 1.

34

Page 37: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Data analysis example

Diffuse large B-cell lymphoma (DLBCL) is the most common typeof non-Hodgkin’s lymphoma

It is clinically heterogeneous:

• 40% of patients respond well

• 60% of patients succumb to the disease

The researchers propose that this difference is due to unrecognizedmolecular heterogeneity in the tumors

We examine the extent to which genomic-scale gene expressionprofiling can further the understanding of B-cell malignancies.

35

Page 38: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Data analysis exampleHere, we have gene expression data at 2,000 genes for 62 cancercells.

There are 3 cancer diagnoses: FL, CLL, DLBCL. Each correspondsto a type of malignant lymphoma.

We want to use hierarchical clustering to understand this data setbetter.

load(’../data/alizadeh.RData’)

genesT = alizadeh$x

genes = t(genesT)

Yfull = alizadeh$type

Y = as.vector(Yfull)

Y[Yfull == "DLBCL-A"] = ’DLBCL’

Y[Yfull == "DLBCL-G"] = ’DLBCL’

Y = as.factor(Y)

dist.mat = dist(genes)36

Page 39: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

PCA plot

●●

●●

−40 −20 0 20 40

−30

−20

−10

010

20

PC1

PC

2

CLLDLBCLFL

Two clear groups forFL and CLL

DLBCL somewhatappears to be 1 group,but it is much morediffuse.

37

Page 40: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

PCA plot

●●

●●

−40 −20 0 20 40

−30

−20

−10

010

20

PC1

PC

2

CLLDLBCL−ADLBCL−GFL

Here are the twosub-types identified bythe researchers

Let’s look at theirresults further.

38

Page 41: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Four hierarchical cluster solutions

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

CLL

CLL

CLL

CLL

CLL

CLL

CLL

CLL

CLL

CLL

CLL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL DLB

CL

DLB

CL

DLB

CL

FL

FL

FL

FL F

LF

LF

LF

LF

LD

LBC

LD

LBC

L

2030

4050

60

hclust (*, "single")

Hei

ght

FL

FL

FL

FL

FL

FL

FL F

LF

LC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

L DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL D

LBC

LD

LBC

LD

LBC

L DLB

CL

DLB

CL D

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

L

2040

6080

100

hclust (*, "complete")

Hei

ght

Single Complete

DLB

CL

DLB

CL

FL

FL

FL

FL

FL

FL

FL F

LF

LC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

L DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL D

LBC

LD

LBC

L DLB

CL

DLB

CL

DLB

CL

DLB

CL D

LBC

LD

LBC

L

2030

4050

6070

80

hclust (*, "average")

Hei

ght

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

CLL

CLL

CLL

CLL C

LL CLL

CLL CLL

FL

FL

DLB

CL FL F

LF

LF

LF

LF

LF

L CLL

CLL

CLL

DLB

CL

DLB

CL

2025

3035

4045

50

hclust (*, "centroid")

Hei

ght

Average Centroid

39

Page 42: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Complete linkage: A closer look

FL

FL

FL

FL

FL

FL

FL F

LF

LC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLC

LLD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

L DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL

DLB

CL D

LBC

LD

LBC

LD

LBC

L DLB

CL

DLB

CL D

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

LD

LBC

L

2040

6080

100

hclust (*, "complete")

Hei

ght

out.com = hclust(dist.mat,

method=’complete’)

plot(out.com,xlab=’’,

main=’’,labels=Y)

rect.hclust(out.com,k=12)

out.cut = cutree(out.com,

k=12)

Notice that FL and CLL aredistinctly grouped, while thereare many clusters inside theDLBCL type.

40

Page 43: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage.

Data analysis example: Conclusion

We find evidence of more sub-types of distinct cancers inpreviously diagnoses

This result is quite interesting

Clinically heterogenous outcomes could be driven by inadequateclusterings

41