Cutting the dendrogram through permutation tests
Transcript of Cutting the dendrogram through permutation tests
Dario Bruzzese Domenico [email protected] [email protected]
Dario Bruzzese, Domenico Vistocco () Compstat 2010 1 / 19
Cutting the dendrogram throughpermutation tests
Department ofPreventive Medical Sciences
UNIVERSITY OF NAPLES ITALY
Department ofEconomics
UNIVERSITY OF CASSINO ITALY
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 2 / 19
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 3 / 19
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
The rep1HighNoise datasetYeung KY, Medvedovic M, Bumgarner KY:Clustering gene-expression data withrepeated measurements.
Genome Biology, 2003, 4:R34
n = 200p = 20
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
Horizontal cutk = 3
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
An alternative cutk = 3
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 5 / 19
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:
n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
︸ ︷︷ ︸︸ ︷︷ ︸
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L C1
R
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L C2
R
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L C3
R
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L C1
R
h(
C1L ∪ C1
R
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L C2
R
h(
C2L ∪ C2
R
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L C3
R
h(
C3L ∪ C3
R
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L
h(
C1L
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1R
h(
C1R
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L
h(
C2L
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2R
h(
C2R
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L
h(
C3L
)
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3R
h(
C3R
)
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1
repeatif C i
L ≡ C iR then
add C iL ∪ C i
R to permClusterselse
add h(C iL) and h(C i
R) to aggregationLevelsToVisitsort aggregationLevelsToVisit in descending order
endremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderend
remove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Initializationi ← 0
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
h(
C1L ∪ C1
R
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
clusters to compare
H0 : C1L ≡ C1
R 7→ reject
C1L C1
R
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
permClusters
aggregationLevelsToVisit
h(C1L ∪ C1
R),h(C1R),h(C
1L)
h(
C1L
)h(
C1R
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
permClusters
aggregationLevelsToVisit
h(C1R),h(C
1L)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L)
h(
C1R
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L)
clusters to compare
H0 : C2L ≡ C2
R 7→ reject
C2L C2
R
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L),h(C
2R),h(C
2L)
h(
C2L
)h(
C2R
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
h(
C1L
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
clusters to compare
H0 : C3L ≡ C3
R 7→ reject
C3L C3
R
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
h(
C3L
)h(
C3R
) aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
h(
C3R
)
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
C4L C4
R
clusters to compare
H0 : C4L ≡ C4
R 7→ accept
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
clusters to compare
H0 : C4L ≡ C4
R 7→ accept
permClusters
C4L ∪ C4
R ⇔ C3R
C3R
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 9
aggregationLevelsToVisit
permClusters
C3L ,C
3R,C
2L ,C
4L ,C
4R
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 9
aggregationLevelsToVisit
permClusters
C3L ,C
3R,C
2L ,C
4L ,C
4R
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each aggregation level k a permutation test isdesigned to test the Null Hypothesis that the twogroups Ck
L and CkR really belong to the same
cluster, i.e. :
H0 : CkL ≡ Ck
R
Under this null, mixing up (permuting) the statisticalunits of Ck
L and CkR should not alter the aggregation
process resulting in their merging in.
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes..
min h(C3j )
max h(C3j )
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes.
The difference between h(
CkL ∪ Ck
R
)and
maxj∈{L,R}
h(
Ckj
)can be, instead, considered as the
cost actually incurred for merging CkL and Ck
R .
h(C3L ∪ C3
R )
max h(C3j )
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes.
The difference between h(
CkL ∪ Ck
R
)and
maxj∈{L,R}
h(
Ckj
)can be, instead, considered as the
cost actually incurred for merging CkL and Ck
R .
The ratio between these two differences:
cost(
CkL ∪ Ck
R
)=
maxj∈{L,R}
h(
Ckj
)− min
j∈{L,R}h(
Ckj
)h(Ck
L ∪ CkR
)− max
j∈{L,R}h(
Ckj
)is thus a measure that characterizes the aggregation process resulting in thenew class Ck
L ∪ CkR
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
C3L C3
R
mC3LmC3
R
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
The ratio:
cost(
mCkL ∪ mCk
R
)=
maxj∈{L,R}
h(
mCkj
)− min
j∈{L,R}h(
mCkj
)h(Ck
L ∪ CkR
)− max
j∈{L,R}h(
mCkj
)is thus a measure that characterizes the aggregation process resulting in thenew (potential) class mCk
L ∪ mCkR
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
Under H0 the aggregation process resulting in the new cluster CkL ∪ Ck
R should be very similar
to the one that potentially produces mCkL ∪ mCk
R ; thus the two values cost(
mCkL ∪ mCk
R
)and
cost(
CkL ∪ Ck
R
)should be close enough.
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
The permutation procedure is repeated M times and each time a new couple mCkL , mCk
R isobtained. The pvalue Montecarlo is thus computed as:
p =#
{cost
(mCk
L ∪ mCkR
)≤ cost
(Ck
L ∪ CkR
)}+ 1
M + 1
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 11 / 19
Some results - Real datasets
The yeast galactosedatasetIdeker T, Thorsson V, Ranish JA,Christmas R, Buhler J, Eng JK,Bumgarner RE, Goodlett DR,Aebersold R, Hood LIntegrated genomic andproteomic analyses of asystemically perturbed metabolicnetwork.
Science 2001, 292:929-934.
n = 205p = 80
Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19
Some results - Real datasets
Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19
% of misclassification = 1.5
Some results - Real datasets
The diabetes datasetBanfield JD, Raftery AEModel–based Gaussian andNon–Gaussian Clustering.
Biometrics, 1993, 49, 803-821.
n = 145p = 3
Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19
Some results - Real datasets
Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19
% of misclassification = 15.2
Some results - Synthetic dataset
QIU W.-L, JOE H. (2009). clusterGeneration: random clustergeneration (with specified degree of separation). R package version1.2.7.
different number of clusters (k = 2; 3; 4; 5; 6; 7)separation index = 0.01different number of variables (p = 5; 10; 15)100 replications for each combination of k and p
Dario Bruzzese, Domenico Vistocco () Compstat 2010 14 / 19
Some results - Synthetic dataset (p=5)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 15 / 19
Some results - Synthetic dataset (p=10)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 16 / 19
Some results - Synthetic dataset (p=15)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 17 / 19
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 18 / 19
ToDo List
Statistical issues
Introducing a penalty term in the permutation test stepQuality measures of the obtained partitionMultiple Testing Problem (???)
Computational issues
profiling and optimizing the R codeI use of compiled codeI deploying a package
Dario Bruzzese, Domenico Vistocco () Compstat 2010 19 / 19