Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf ·...
Transcript of Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf ·...
![Page 1: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/1.jpg)
Hierarchical Clustering
Some slides by Serafim Batzoglou
1 ABDBM © Ron Shamir
![Page 2: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/2.jpg)
2
From expression profiles to distances Expression
levels,
“Raw Data”
experiments
gene
s
In some situation the input for clustering is only the similarities / distances 10 20 30 40 50 60
10
20
30
40
50
60
From the Raw Data matrix we compute the similarity matrix S. Sij reflects the similarity of the expression patterns of gene i and gene j.
experiments experiments
ABDBM © Ron Shamir
![Page 3: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/3.jpg)
More generally • In K-means and SOM the input was a
vector for each item (e.g. a dot in Rn) • Here we have a matrix of pairwise
distances between items, and we wish to cluster the items.
• A distance based clustering alg
ABDBM © Ron Shamir 3
![Page 4: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/4.jpg)
An Alternative view of Clustering Form a tree-hierarchy of the input elements satisfying: • More similar elements are placed closer along the tree. •Or: Tree distances reflect element similarity •Note: No explicit partition into clusters.
4 ABDBM © Ron Shamir
![Page 5: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/5.jpg)
Partitioning vs Hierarchical Representations
“dendrogram”
5 ABDBM © Ron Shamir
![Page 6: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/6.jpg)
Hierarchical Representations (2)
1 3 4 2 1 3 4 2
2.8
4.5 5.0
Ultrametric: rooted tree, all root-leaf distances are equal
6 ABDBM © Ron Shamir
![Page 7: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/7.jpg)
7
UPGMA Clustering (unweighted pair group method using arithmetic averages)
• Approach: Form a tree; closer species according to input distances should be closer in the tree
• Build the tree bottom up, each time merging two smaller trees
• All leaves are at same distance from the root
![Page 8: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/8.jpg)
Hierarchical Clustering: UPGMA Sokal & Michener 58, Lance & Williams 67
UPGMA (unweighted pair group method using arithmetic averages) Given two disjoint clusters Ci, Cj, 1 dij = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq |Ci| × |Cj| If Ck = Ci ∪ Cj, then distance from Ck to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|
![Page 9: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/9.jpg)
Algorithm: UPGMA
Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Cr, Cs s.t. drs is min Define a new cluster Ct = Cr ∪ Cs Define node Ars connecting Cr, Cs, height
drs/2
Delete Cr, Cs
dit=dti=(|Cr |•dir+ |Cs| • dis)/(|Cr |+ |Cs| ) length(Cr, Ars) = height(Ars) - height(Cr) length(Cs,Ars) = height(Ars) - height(Cs) Termination: When all sequences belong to one cluster
1 4
3 2 5
1 4 2 3 5
Time: Naïve: O(n3); Can show O(n2 logn) (ex.); O(n2) (harder ex.)
Thm: If the input distances match an ultrametric tree – UPGMA finds it.
![Page 10: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/10.jpg)
http://lectures.molgen.m
pg.de/Phylogeny/Ultram
etric/ 11 ABDBM © Ron Shamir
![Page 11: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/11.jpg)
Robert R. Sokal (1926- 2012) Ph.D. 1952, University of Chicago. Was at Dept. of Ecology and Evolution, SUNY Stony Brook Member of the National Academy of Sciences & American Academy of Sciences. Promoted the use of statistics in biology and co-founded the field of numerical taxonomy. Together with P.H.A. Sneath, authored the two defining texts in this field. Along with F. James Rohlf, authored the very popular biostatistics book, Biometry. Editor of the American Naturalist, president of several learned societies.
12 ABDBM © Ron Shamir
![Page 12: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/12.jpg)
Results (2) • 10 major groups with similar patterns of co-
occurrence, confirming that specific groups of phenotypes co-occur within families.
• certain malformations co-occur in more than one group, e.g. TGA,AVSD.
• Some differences from a proposed taxonomy (Houyel 11)
• (Also: co-occurrence of defects in families is caused by shared susceptibility genes.)
• A starting point for further biomed research
ABDBM © Ron Shamir 15
![Page 13: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/13.jpg)
Variants on hierarchical clustering • Input: Distance matrix Dij; • Initially each element is a cluster. • Find min element Drs in D; merge clusters r,s • Delete elts. r,s, add new elt. t with updated weights • Repeat • Variants:
– Average linkage: UPGMA – Single linkage: Dit= min(Dir, Dis) – Max linkage Dit= max(Dir, Dis)
• Sometimes the number of clusters is needed.
Methods abound.
• Sometimes leaf order matters and not only topology. 16 ABDBM © Ron Shamir
![Page 14: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/14.jpg)
Hierarchical clustering of GE data Eisen et al., PNAS 1998
• Growth response: Starved human fibroblast cells, added serum
• Monitored levels of 8600 genes over 13 time-points • tij - level of target gene i in condition j; • rij – same for reference • Dij= log(tij/rij) • D*ij= [Dij –E(Di)]/std(Di) • Similarity of genes k,l: Skl=(ΣjD*kj •D*lj)/Ncond • Applied average linkage method • Ordered leaves by increasing subtree weight:
average expression level, time of maximal induction, other criteria
17 ABDBM © Ron Shamir
![Page 15: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/15.jpg)
18 ABDBM © Ron Shamir
![Page 16: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/16.jpg)
19 ABDBM © Ron Shamir
![Page 17: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/17.jpg)
Clus
teri
ng t
he s
ame
data
aft
er
rand
omly
per
mut
ed w
ithi
n ro
ws
(1),
colu
mns
(2) a
nd b
oth(
3)
20 ABDBM © Ron Shamir
![Page 18: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/18.jpg)
Observations • Distinct measurements of same genes
cluster together • Genes of similar function cluster
together • Many cluster-function specific
insights • Interpretation is a REAL biological
challenge
21 ABDBM © Ron Shamir
![Page 19: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/19.jpg)
Yeast GE data
22 ABDBM © Ron Shamir
![Page 20: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/20.jpg)
Mike Eisen & Pat Brown
23 ABDBM © Ron Shamir
![Page 21: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/21.jpg)
More on hierarchical methods (2) • The methods described above –
agglomerative (bottom up) • An alternative approach: Divisive (top down) • Advantages:
– gives a single coherent global picture – Intuitive for biologists (from phylogeny)
• Disadvantages: – no single partition; no specific clusters – Forces all elements to fit a tree
• There are other methods that do not assume an ultrametric solution, notably Neighbor Joining. In genomics still UPGMA rules.
24 ABDBM © Ron Shamir
![Page 22: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/22.jpg)
Hierarchical Clustering & Congenital Heart Defects
Ellsoe et al. (Soren Brunak lab) European Heart Journal (2017)
25 ABDBM © Ron Shamir
![Page 23: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/23.jpg)
CHD • Congenital heart defects (CHD)
affect almost 1% of all live born children
• Number of adults with CHD is increasing
• Recurrence patterns in families are poorly understood
• Do cases in the same family tend to have similar types of malformations?
26 ABDBM © Ron Shamir
![Page 24: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/24.jpg)
Study • 1163 families, 3080 family members
with clinical diagnosis (avg 2.65 CHD cases /family)
• Each case is identified as having one or more of 41 different types of CHD lesions: AVD, BSD, VSD,…
ABDBM © Ron Shamir 27
![Page 25: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/25.jpg)
Concordant & discordant disease pairs
ABDBM © Ron Shamir 28
Concordant: (ASD,ASD), (ASD,VSD)… Discordant: (BAV,BAV)
![Page 26: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/26.jpg)
Gender ratio, concordance & discordance
ABDBM © Ron Shamir 29
![Page 27: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/27.jpg)
Scoring pairs of defects • N(A,B) – # families with A, B • N(A,¬B) – # families with A, not B • N(¬ A,B) – # families with B, not A • N(¬ A, ¬ B) – # families with none • The odds ratio (OR) between phenotypes A and B:
• OR(A,B) = N(A,B) N(¬ A, ¬ B)/N(A,¬B)N(¬ A,B) • ??! • Perhaps OR(A,B) = N(A,B)/[N(A,¬B)+N(¬ A,B)]
ABDBM © Ron Shamir 30
![Page 28: Hierarchical Clustering - cs.tau.ac.ilrshamir/abdbm/pres/17/HierarchicalClustering.pdf · Hierarchical clustering of GE data . Eisen et al., PNAS 1998 •Growth response: Starved](https://reader034.fdocuments.us/reader034/viewer/2022050314/5f77218d8291fd450a002500/html5/thumbnails/28.jpg)
ABDBM © Ron Shamir
Results
31