[IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC,...
Transcript of [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC,...
Cluster Ensemble for Gene Expression Microarray
Data: Accuracy and Diversity
Marcilio C. P. de Souto, Daniel S. A. de Araujo, Bruno L. C. da Silva
Department of Informatics and
Applied Mathematics
Federal University of Rio Grande do Norte
Natal, RN, Brazil, 59072-970
E-mail: [email protected], [email protected], [email protected]
Abstract— The classification of different types of cancer,historically, depended on efforts by the biologists that tried toestablish, based on assumptions, the subtypes of a given tumor.However, up to now, there is no well-grounded methodologythat aids to deal with such a task. One step towards this hasarisen with the idea of analyzing the gene expression of tumorsand basing the decision on such an analysis. In this context, weanalyze the potential of applying cluster ensemble techniquesto gene expression microarray data. Our experimental resultsshow that there is often a significant improvement in the resultsobtained with the use of ensemble when compared to thosebased on the clustering techniques used individually.
I. INTRODUCTION
The aim of this paper is to investigate the potential
of applying cluster analysis techniques to gene expression
microarray data. As pointed out by [1], this type of un-
supervised analysis is of increasing interest in the field
of functional genomics and gene expression data analysis.
One of the reasons for this is the need for molecular-
based refinement of broadly defined biological classes, with
implications in cancer diagnosis, prognosis and treatment [1],
[2], [3].
More specifically, we develop experiments with the cluster
ensemble methods described in [4], [5], [6]. In a previous
work [7], we presented preliminary results with the technique
in [5]. The goal of the clustering combining methods (ensem-
bles or committees) is to find a consensus among different
results (partitions) generated with one or more clustering
algorithms. The resulting consensus partition should be more
robust than the original ones [8].
In order to develop our experiments, the partitions pro-
duced by three clustering algorithms, representative of dif-
ferent clustering paradigms, are selected as input for building
the ensembles: the k-means, the Expectation-Maximization
(EM) algorithm, and the hierarchical method with average
linkage [9], [10]. These algorithms have been widely used
in the gene expression literature [1], [2], [3].
Furthermore, three gene expression datasets are used in
this work [1]: St. Jude Leukemia, Novartis multi-tissue,
Gaussian3. These datasets allow us, among other things, an-
alyze the ability of the clustering method finding refinement
of broadly defined biological classes.
The remain of this paper is divided into six sections.
Section II presents the main concepts on the cluster ensem-
bles methods. In Section III, we introduce the evaluation
methodology for our experiments. The description of the
datasets used are shown in Section IV. Our experimental
setup is presented in Section V. In Section VI, we show the
results obtained with the individual clustering methods, as
as for the homogeneous and the heterogeneous ensembles
(in terms of accuracy and diversity). Finally, Section VII
summarizes the main results obtained and indicates some
possible future work.
II. CLUSTER ENSEMBLE
A cluster ensemble consists of two parts [11], [12], [8]: a
base partition constructor and a consensus function. Given a
data set, a base partition constructor generates a collection
of clustering (base partition) solutions. A consensus function
then combines the base partitions and produces a single
partition as the final output of the ensemble system. Below
we formally describe these two parts.
• [11], [12] Base partitions generation strategy. Given
a set of n objects (instances) X = {x1,x2, ...,xn},
a base partition constructor generates a collection of
partitions, denoted by Π = {π1, π2, ..., πr}, where r is
the number of base partitions. Each πi is a partition of
X in Ki disjoint groups (clusters) of objects, denoted
by πi = {ci1, c
i2, ..., c
iKi}, where ∪Ki
j=1cij = X .
• [12] Intergration strategy (consensus function).
Given a set Π of base partitions and a number K of
groups to be generated, a consensus function Γ uses the
information provided by Π to partitioning X into K
disjoint groups, yielding πf as final solution.
With respect to the first part, several approaches compara-
ble to those used in supervised learning have been proposed
to introduce artificial instabilities in clustering algorithms [8].
For example, in this paper, we analyze ensembles based on
combining several runs of the same clustering algorithm with
different initial conditions. Another approach studied is to
generate ensembles from the combination of the results of
different clustering methods.
The combination of multi-partitions, the second part, is
often accomplished via the definition of a consensus func-
0-7803-9490-9/06/$20.00/©2006 IEEE
2006 International Joint Conference on Neural NetworksSheraton Vancouver Wall Centre Hotel, Vancouver, BC, CanadaJuly 16-21, 2006
2174
tion. For example, in this paper, we develop experiments
with three successful cluster ensemble (consensus functions)
methods:
• Co-association matrix [4], [13].
• Re-labeling and voting [6].
• Methods based on graph partitioning [5].
In the combination strategy based on co-association ma-
trix, the similarity between two objects is estimated by count-
ing the number of shared clusters in all the base partitions.
The underlying assumption is that objects belonging to a
natural cluster are very likely to be placed in the same group
in different partitions. This similarity expresses the force of
co-association of each pair of objects [4], [13].
In fact, this matrix can be seen as a similarity matrix.
Therefore, it can be used as input to any clustering algorithm
that operates directly with a similarity matrix. For example,
in order to find the consensus partition, [4] apply the hierar-
chical method with single linkage to this matrix - the partition
is generated cutting the dendogram at a given height. In this
work, as [4], we will use the hierarchical method with single
linkage to find the consensus function. Hereafter, for short,
we will refer to this method as Co-association.
The other consensus function analyzed is based on a
voting scheme [6]. In the voting scheme, we try to find a
correspondence between the cluster labels across the different
partitions. Once this correspondence is found, the clusters
with the same label are fused. Solving the correspondence
problem is important because the labels that we assign
to the clusters in the individual partitions are arbitrary.
Consequently, if two identical partitions have the labels of
their clusters permuted, they could be perceived as different
ones. For k clusters, there are k! label permutations and an
exhaustive search may not be possible for large k.
In order to avoid the need of an exhaustive search, [14]
proposed a heuristic approximation to consistent labeling.
Their algorithm combines r partitions on a sequential basis:
at each step, a locally optimal permutation is found and two
partitions are fused. After voting of r partitions, we obtain
for every data point xi and every cluster j a value pij . This
value represents the fraction of times that xi has be assigned
to k. For interpreting the final result, we can either accept
this fuzzy decision or assign every data point xi to the cluster
with highest pij . In this paper, we use the latter.
With respect to the ensembles methods in [5], they for-
mulate the consensus function as a graph partitioning task.
Given a weighted graph, a general goal of graph partitioning
is to find a K-way partition that minimizes the cut (sum
of the weights of the edges to be removed), subject to the
constraint that each part should contain roughly the same
number of vertices [12]. Each connected component after
the cut corresponds to a cluster in the consensus partition.
In practice, various graph partitioning algorithms define
different optimization criteria based on the above goal. In
this context, [5] introduce three different consensus func-
tions, each of which formulates and solves a different graph
partitioning problem given the set of base partitions. The
first approach, CSPA (Cluster-based Similarity Partitioning
Algorithm), builds a co-association matrix, which is used to
create a graph - objects are vertices and their similarities
are the weights of the edges. The second function, HGPA
(HiperGraph-Partitioning Algorithm), models every cluster
of the base partitions as a hyperedge of a hipergraph. Finally,
the third consensus function, MCLA ( Meta-CLustering
Algorithm), models clusters as vertices in a r-partite graph,
where r is the number of base partitions. Hereafter, for short,
we will refer to these three consensus function as Graph.
III. EVALUATION METHODOLOGY
The evaluation we use is aimed mainly at assessing how
good the clustering method investigated is at recovering
known clusters from gene expression microarray data. In
order to do so, we consider three datasets for which multi-
class distinction is available. As in other works, each of
this datasets constitutes the gold standard against which
we evaluate the clustering results [1], [9]. Following the
conversion in [1], we refer to the gold standard partition
as class, while we reserve the word clusters for the partition
returned by the clustering algorithm.
A. Accuracy and Diversity
In the context of clustering learning algorithms, there is no
definitive measure of accuracy [9]. Nevertheless, in cases in
which class labels are available, external evaluation criteria
can provide an effective means of assessing the quality of a
partition [9], [8].
For example, the cluster composition can be evaluated by
measuring the degree of agreement between two partitions
(U and V), where partition U is the result of a clustering
method and partition V (the gold standard) is formed by an a
priori information independent of partition U, such as a class
label. There are a number of external indices defined in the
literature, such as Hubbert, Jacard, Rand and corrected Rand
(or adjusted Rand) [9] that can be used for this measurement.
One characteristic of most of these indices is that they can
be sensitive to the number of classes in the partitions or to
the distributions of elements in the clusters. For example,
some indices have a tendency to present higher values for
partitions with more classes (Hubbert and Rand), others for
partitions with a smaller number of classes (Jaccard) [15].
The corrected Rand index, which has its values corrected
for chance agreement, does not have any of these undesirable
characteristics [16]. Thus, the corrected Rand index - cR,
for short - is the external index used in the evaluation
methodology used in this work. The corrected Rand index
can take values from -1 to 1, with 1 indicating a perfect
agreement between the partitions, and the values near 0
or negatives corresponding to cluster agreement found by
chance.
Formally, let U = {u1, . . . , ur, . . . , uR} be the par-
tition given by the clustering solution, and V ={v1, . . . , vc, . . . , vC} be the partition formed by an a priori
information independent of partition U (the gold standard).
The corrected Rand is defined as
2175
cR =
∑R
i
∑C
j
(
nij
2
)
−(
n
2
)
−1 ∑R
i
(
ni·
2
)∑C
j
(
n·j
2
)
12 [
∑R
i
(
ni·
2
)
+∑C
j
(
n·j
2
)
] −(
n2
)
−1 ∑R
i
(
ni·
2
)∑C
j
(
n·j
2
)
where (1) nij represents the number of objects in clusters
ui and vj ; (2) ni· indicates the number of objects in cluster
ui; (3) n·j indicates the number of objects in cluster vj ; (4)
n is the total number of objects; and (5)(
ab
)
is the binomial
coefficient a!b!(a−b)! .
Besides assessing the accuracy of the ensembles by means
of the corrected Rand index, we also investigate the rela-
tionship between accuracy and diversity. Diversity within an
ensemble is of vital importance for its success. An ensemble
formed from identical base partitions will not outperform its
individual members [17].
In order to analyze this aspect, we will use a diversity
measure proposed in [17], which is based on the corrected
Rand index (cR). This measure is a pairwise one in that the
diversity of an ensemble will be proportional to the diversity
between its individual components (Equation 1). Formally,
suppose that our ensemble partition πf was generated from
Π = {π1, π2, ..., πr}, its diversity can be calculated using
the following pairwise measure.
D =2
r(r − 1)
r−1∑
i=1
r∑
j=i+1
(1 − cR(πi, πj)) (1)
B. Cross-validation
In order to compare the performance of the clustering
algorithms, we calculate the mean of the cR via a unsu-
pervised k-fold cross-validation procedure [18]. The data
set is, in the unsupervised k-fold cross-validation procedure,
divided in k folds. At each iteration of the procedure, one
fold is used as the test set, and the remaining folds as the
training set. The training set is presented to a clustering
method, giving a partition as result (training partition). Then,
the nearest centroid technique is used to build a classifier
from the training partition. The centroid technique calculates
the proximity between the elements in the test set and
the centroids of each cluster in the training partition (the
proximity must be measured with the same proximity index
used by the clustering method evaluated).
A new partition (test partition) is then obtained by as-
signing each object in the test set to the cluster with nearest
centroid. Next, the test partition is compared with the a priori
partition (gold standard) by using an external index (this a
priori partition contains only the objects of the test partition).
At the end of the procedure, a sample with size k of the
values for the external index is available.
The general idea of the k-fold cross-validation procedure
is to observe how well data from an independent set is
clustered, given the training results. If the results of a training
set have a low agreement with the a priori classification,
so should have the results of the respective test set. In
conclusion, the objective of the procedure is to obtain k
observations of the accuracy of the unsupervised methods
with respect to the gold standard, all this with the use of
independent test folds.
IV. DATASETS
The three gene expression datasets used in this work are
the St. Jude leukemia, Novartis multi-tissue, and Gaussian3
- exactly as presented in [1] - Table I. St. Jude Leukemia
and Novartis multi-tissue are real datasets. More specif-
ically, the former is composed by instances representing
diagnosed samples of bone marrow from pediatric acute
leukemia patients, corresponding to six prognostically impor-
tant leukemia subtypes - 43 T-lineage ALL; 27 E2A-PBX1;
15 BCR-ABL; 79 TEL-AML1 and 20 MLL rearrangements;
and 64 “hyperdiploid > 50” chromosomes. The instances of
the latter represent tissue samples from four distinct cancer
types - 16 breast, 26 prostate, 28 lung, and 23 colon samples.
TABLE I
DESCRIPTION OF THE DATASETS
Dataset # Classes # Instances # Attributes
Gaussian3 3 60 600
Novartis multi-tissues 4 103 1000
St. Jude leukemia 6 248 1000
In contrast to St. Jude Leukemia and Novartis multi-
tissue datasets, Gaussian3 is an artificial dataset. Gaussian3
represents the union of three Gaussian distributions (three
classes) in a 600-dimension space. Such data simulates
the process of gene co-regulation. Each class is uniquely
characterized by a subset formed by 200 attributes (genes),
which are set to represent up-regulated values in this class
and down-regulated values for the other two.
V. EXPERIMENTS
The experiments were accomplished by presenting the
three datasets (Gaussian3, Norvatis multi-tissue, and St. Jude
Leukemia) to the individual clustering methods: k-means,
EM algorithm, and the hierarchical method with average link-
age (all of them implemented with the Euclidian distance).
Initially, five replication of the 2-fold cross-validation of each
dataset were performed such that 10 samples of each dataset
was formed.
In terms of parameters settings, the number of cluster for
the k-means and the EM algorithm were varied in two ways.
First, we conducted experiments in which such a number was
set to the number of classes in the dataset being considered -
e.g., c=3 for the Gaussian3 dataset. Then, we did an another
run of experiments in which the number of clusters was set
to the double of the number of classes - e.g., c=6 for the
Gaussian3 dataset.
Furthermore, as these algorithms are dependent on the
choice of the initial conditions (e.g., initial centers and
means), for them we repeated each run 10 times, each one
with a distinct random initialization. For example, for each
2176
one of the 10 cross-validation samples of each dataset, the
experiment with k-means (or the EM algorithm) yielded 10
partitions - Table II (cR stands for the mean of the corrected
Rand for the independent test sets).
Since the hierarchical method with average linkage is
deterministic, only 10 runs of this algorithm was executed,
that is, one run for each cross-validation sample of the
dataset. Also, as the external index used in this work is
suitable only for partition comparison, in order to build the
partition from the hierarchical methods, the trees were run
from root to the leaves, then the c first sub-trees were taken
as the clusters (with c equal to both the exact and the double
of the number of classes in the dataset) - Table II.
The ensembles were built according to the type of base
partitions used as input (with c or 2*c clusters per partition) -
Tables III, IV, V. The first type - Ensemble-(c,c) - received as
input base partitions whose number of clusters was set to the
exact number c of classes in the datasets. The final partition
was also set to have K=c clusters. In contrast, the number of
clusters for the base partitions for the second configuration
- Ensemble-(2*c,c) - was set to 2*c. Here, like in the first
case, the number of clusters in the final partition was also
set to K=c.
In the context above, the ensembles were formed as
follows. First, for the k-means (EM algorithm) we created 10
ensemble, where each ensemble was formed by combining
the 10 runs of the k-means (EM algorithm) for a given
cross-validation partition (10 samples in total - 2x5 cross-
validation). That is, we formed homogeneous ensembles in
that the partitions used to form the consensus come from the
same type of algorithm. As the voting scheme in [6] imposes
that the number of clusters in the final partition and in the
base partitions be the same, such a method was used only
with the first ensemble configuration - Ensemble(c,c). Thus,
in Tables III, IV, V, the term “n/a” stands for not applicable.
We also combine partitions produced by the different
clustering algorithms (heterogeneous ensembles). But before
doing so, for the k-means experiment (EM algorithm ex-
periment), from the 10 partitions generated by the repeated
random restarts, we choose only one for further analysis: the
one with the largest Silhouette index [19]. In a Silhouettes
calculation, the distance from each data point in a cluster
to all other data points within the same cluster (tightness)
and to all data points in the closest cluster (separation)
are determined. Based on this value, one can calculate the
average Silhouette for the partition being considered. A good
partition of the dataset is indicated by the largest values of
the Silhouettes index.
In this context, the heterogeneous ensembles were formed
according to the cross-validation samples for each dataset,
that is, we built 10 different ensembles. More specifically,
the inputs for building a given ensemble (e.g., for a given
cross-validation sample for a certain dataset) were the mul-
tiple clustering label vectors, where each vector represented
the resulting partition of a given individual technique - in
Tables III, IV, V, the term “Heter.” stands for ensemble
heterogeneous.
Finally, for all the experiments, the mean values of the
corrected Rand index (cR) for the test folds were measured.
Next, the mean of cR obtained with the individual methods
were compared two by two to those obtained with the
ensembles. This was accomplished by means of a paired t-
test, as described in [20].
VI. RESULTS
A. Homogeneous ensembles: accuracy
According to Table III, for the Gaussian3 dataset, the Co-
association consensus function obtained for both k-means
and EM (homogeneous ensembles) an average accuracy
significantly higher than the respective individual methods.
Still in this context, there is no statistical evidence to state
the difference between the performance of the Co-association
ensemble with the configurations (c,c) and (2 ∗ c,c). For the
Voting consensus function, besides the significant gain in
accuracy, we can observe a great decrease in the standard
deviation. For example, the individual EM algorithm for c=3
(Table II) presented a cR mean of 0.58 with a standard
deviation of 0.26, whereas the Voting ensemble formed by
these EM partitions (Table III) showed a cR mean of 0.99
with a standard deviation of 0.03. There was even a greater
improvement of the results when the Graph consensus func-
tion was applied. Such ensembles presented a cR mean of
1.00, that is, a perfect agreement with the a priori partition.
For the Norvatis multi-tissue dataset, according to Ta-
ble IV, the use of the Co-association consensus function
does not lead to a gain in performance when compared to
the individual methods. In fact, in the case of the (2 ∗ c,c)
configuration, the null hypothesis was rejected in favor of
the k-means and the EM. In contrast, for the case of the
Voting consensus function, like the results obtained for the
Gaussian3 dataset, there was a significant increase in terms
of accuracy (cR mean) as well as robustness (standard
deviation). For instance, the k-means for c=4 (Table II)
presented a cR mean of 0.75 with a standard deviation of
0.20, whereas the Voting ensemble formed by these k-means
partitions (Table IV) showed a cR mean of 0.94 with a
standard deviation of 0.01. The Graph consensus function
presented a behavior similar to the Voting one.
In general, the St. Jude leukemia dataset (Table V) was
the hardest one to be clustered by the ensemble methods.
For example, most of the homogeneous ensembles formed
presented a performance comparable or inferior to the indi-
vidual clustering methods.
B. Heterogeneous ensembles: accuracy
The results for the Gaussian3 dataset (Table III) when
applied to the Co-association consensus function were, in
general, inferior to those of the individual methods. In
contrast, the Voting and the Graph consensus functions, like
in the case of homogeneous ensembles, obtained a cR mean
of 1.0, that is, a perfect agreement with the a priori partition.
Table IV illustrates the performance of ensembles methods
for the Novartis-multi tissue dataset. Here, as for the other
2177
datasets, the results obtained with the Co-association con-
sensus function were similar (configuration (c,c)) or worse
(configuration (c,2∗c) than those obtained with the respective
individual methods. On the other hand, in the case of the
Voting consensus function, the same behavior pattern of
the homogenous ensembles occurred - increase in terms
of accuracy (cR mean) as well as robustness (standard
deviation). For instance, the hierarchical method with average
linkage for c=4 (Table II) presented a cR mean of 0.78 with
a standard deviation of 0.14, whereas the Voting ensemble
formed by the three different methods (Table IV) showed
a cR mean of 0.96 with a standard deviation of 0.03. The
Graph consensus function also presented a performance sig-
nificantly superior to the individual methods. In this context,
there was no statistical evidence to state the difference of
performance between the (c,c) and (c,2 ∗ c) configurations.
In terms of the St. Jude leukemia dataset, for the two
ensemble configurations - (c,c) and (2 ∗ c,c) - the results
obtained with the Co-association consensus function were
inferior to those achieved by the respective individual meth-
ods. The null hypotheses were rejected in favor of the
individual methods. The heterogeneous ensembles built with
the Voting consensus function (Table V) presented a signifi-
cantly higher accuracy (0.93) than the individual methods (k-
means=0.80, EM=0.83, and hierarchical=0.50). The Graph
consensus function presented a behavior similar to that of the
Co-association one, that is, no improvement when compared
to the individual methods.
C. Ensemble diversity
Table VI illustrates the mean of the Diversity measure (D)
- Equation 1, for each set of base partitions used as input
to build the ensembles. According to this table, although
combining clusters from multiple partitions is useful only
if there is disagreement between the partitions, diversity in
itself is not a sufficient condition to determine the future
behavior of ensemble formed.
For instance, with exception for the St. Jude leukemia data
for c=6, the diversities of the base partitions used as input
for the heterogeneous ensembles were much lower than the
ones for the homogeneous ensembles. On the other hand,
the accuracy of the heterogeneous ensembles were, with the
exception of the Co-association consensus function, superior
or equivalent to those of the homogeneous ensembles. Thus,
not only the diversity, but the choice of the type of consensus
function has a great impact on the success of the ensemble.
The behavior described in the previous paragraph can be
observed, for example, in the case of the Gaussian3 dataset.
The diversity mean for c=3 for the k-means ensembles was
0.80 (Table VI), whereas the accuracy means for the the Co-
association, the Voting, and the Graph consensus functions
were, respectively, 0.68, 0.92, and 1.0. That is, despite of
the high diversity of the base partitions, the Co-association
consensus function was not very successful in recovering the
underlying structure in the dataset.
This kind of behavior was also observed in the context
of lower diversity. For instance, in the case of c=3 for the
Gaussian dataset, the diversity mean for the base partitions
for the heterogeneous ensembles was only of 0.21 (Table VI).
As one can see at Table II, the means (and standard devia-
tions) of the corrected Rand indexes for the k-means and the
EM were, respectively, 0.56± 0.30 and 0.58± 0.26. That is,
if the Silhouette index (see Section V) was able to identify
the best partitions to form the ensembles, they would have
corrected Rand indexes around, respectively, 0.86 and 0.84.
Thus, for presenting a high value for the corrected Rand
indexes, these base partition, though presenting a lower
diversity (see Equation 1), would have high quality in terms
of accuracy. In this case, most of the diversity would come
from the partitions built with the hierarchical method - they
presented a cR mean of only 0.50 with very small standard
deviation (0.03).
In the context above, whereas the Voting and the Graph
consensus functions were able to take the best of the base
partitions in order to improve the mean of the accuracy to
1.0± 0.0 (perfect agreement with the a priori partition); the
Co-association consensus function led the accuracy to only
0.48± 0.29.
VII. CONCLUSIONS
In this paper, we discussed cluster ensemble methods and
conducted a series of experiments with real and synthetic
gene expression datasets. More specifically, we analyzed,
from an experimental point of view, some strategies for
generating (homogeneous and heterogeneous) and integrating
(Co-association, Voting, and Graph consensus functions) the
ensembles.
The experimental analysis indicates that the ensemble
techniques used, often, offered considerable potential to
improve the accuracy, when compared to those achieved
by the individual clustering methods (k-means, EM, and
hierarchical method with average linkage). In general, there
was no significant difference between the cluster ensem-
bles methods studied, with exception of the Co-association
consensus function with the hierarchical method with single
linkage as meta-clustering algorithm that presented often a
poorer performance.
This can be clearly observed for the St. Jude leukemia
dataset (Table V). One of the reasons for this poor perfor-
mance could be the fact that clusters in such a dataset are not
well-separated. This kind of situation is not very well dealt
with the single linkage algorithm [9], [13]. An alternative for
this problem would be the use of other clustering methods,
such as the average linkage, as meta-clustering [13].
Our results have also shown, at least for the methods and
the datasets used, that there was no significant performance
gain by combining partitions produced by different clustering
algorithms (heterogeneous ensembles), compared to the strat-
egy of combining several runs of each of clustering algorithm
(homogeneous ensembles). However, this issue should be
further investigated.
Furthermore, the experiments conducted showed that even
when the consensus functions were set to (1) get as input
base partitions whose number of clusters was the double of
2178
the number of the known class, and (2) to produce as output a
final partition with the true number of class; the final partition
generated were as accurate as the original base partitions.
Finally, our experimental result corroborates the claim that,
although the combination of clusters from multiple partitions
is useful only if there is disagreement between the partitions,
diversity alone is not a sufficient condition to determine the
future behavior of ensemble formed. In fact, not only the
diversity, but the choice of the type of consensus function
has a great impact on the success of the ensemble.
TABLE II
INDIVIDUAL METHODS
Dataset Algorithm cR for c cR for 2 ∗ c
Gaussian3k-means 0.56 ± 0.30 0.77 ± 0.24
EM 0.58 ± 0.26 0.76 ± 0.24
Hier. 0.50 ± 0.03 0.93 ± 0.03
Novartisk-means 0.75 ± 0.20 0.80 ± 0.08
EM 0.75 ± 0.20 0.79 ± 0.09
Hier. 0.78 ± 0.14 0.80 ± 0.08
St. Judek-means 0.80 ± 0.12 0.70 ± 0.09
EM 0.83 ± 0.09 0.70 ± 0.09
Hier. 0.50 ± 0.03 0.92 ± 0.03
TABLE III
ENSEMBLE METHODS - GAUSSIAN3
Ensemble Base cR for c=3 cR for c=6
Co-association
k-means 0.68 ± 0.23 0.82 ± 0.37
EM 0.73 ± 0.23 0.63 ± 0.30
Heter. 0.48 ± 0.29 0.17 ± 0.27
Votingk-means 0.92 ± 0.18 n/aEM 0.99 ± 0.03 n/aHeter. 1.00 ± 0.00 n/a
Graphk-means 1.00 ± 0.00 1.00 ± 0.0
EM 1.00 ± 0.00 1.00 ± 0.0
Heter. 1.00 ± 0.00 1.00 ± 0.0
TABLE IV
ENSEMBLE METHODS - NOVARTIS MULTI-TISSUE
Ensemble Base cR for c=4 cR for c=8
Co-association
k-means 0.71 ± 0.22 0.60 ± 0.24
EM 0.84 ± 0.16 0.86 ± 0.08
Heter. 0.69 ± 0.21 0.50 ± 0.30
Votingk-means 0.94 ± 0.01 n/aEM 0.95 ± 0.03 n/aHeter. 0.96 ± 0.03 n/a
Graphk-means 0.96 ± 0.02 0.96 ± 0.02
EM 0.96 ± 0.02 0.96 ± 0.02
Heter. 0.96 ± 0.02 0.95 ± 0.03
ACKNOWLEDGMENT
We would like to thank Evgenia Dimitriadou for providing
her code for the voting-merging clustering algorithm. The
authors also would like to thank the financial support of
CNPq via grant 470319/2004-6.
TABLE V
ENSEMBLE METHODS - ST. JUDE LEUKEMIA
Ensemble Base cR for c=6 cR for c=12
Co-association
k-means 0.25 ± 0.13 0.22 ± 0.15
EM 0.18 ± 0.18 0.28 ± 0.16
Heter. 0.35 ± 0.07 0.16 ± 0.12
Votingk-means 0.77 ± 0.13 n/aEM 0.75 ± 0.12 n/aHeter. 0.93 ± 0.04 n/a
Graphk-means 0.79 ± 0.07 0.46 ± 0.02
EM 0.87 ± 0.07 0.45 ± 0.03
Heter. 0.74 ± 0.10 0.82 ± 0.08
TABLE VI
MEAN OF THE DIVERSITY INDEX
Dataset Algorithm D for c D for 2 ∗ c
Gaussian3k-means 0.80 ± 0.09 0.75 ± 0.06
EM 0.79 ± 0.08 0.71 ± 0.04
Heter. 0.21 ± 0.16 0.52 ± 0.12
Novartisk-means 0.40 ± 0.11 0.41 ± 0.04
EM 0.40 ± 0.05 0.43 ± 0.04
Heter. 0.16 ± 0.13 0.24 ± 0.06
St. Judek-means 0.31 ± 0.05 0.46 ± 0.02
EM 0.25 ± 0.04 0.45 ± 0.03
Heter. 0.36 ± 0.02 0.32 ± 0.06
REFERENCES
[1] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus clustering:a resampling-based method for class discovery and visualization ofgene expression microarray data,” Machine Learning, vol. 52, pp. 91–118, 2003.
[2] J. Quackenbush, “Computational analysis of cDNA microarray data,”Nature Reviews, vol. 6, no. 2, pp. 418–428, 2001.
[3] D. Slonim, “From patterns to pathways: gene expression data analysiscomes of age,” Nature Genetics, vol. 32, pp. 502–508, 2002.
[4] A. L. N. Fred and A. K. Jain, “Data clustering using evidence accu-mulation.” in 16th International Conference on Pattern Recognition,2002, pp. 276–280.
[5] A. Strehl and J. Ghosh, “Cluster ensembles – a knowledge reuseframework for combining multiple partitions,” Journal on MachineLearning Research (JMLR), vol. 3, pp. 583–617, 2002.
[6] E. Dimitriadou, A. Weingessel, and K. Hornik, “A cluster ensemblesframework,” in Third International Conference on Hybrid IntelligentSystems (HIS), 2003, pp. 528–534.
[7] M. C. P. de Souto, S. C. M. Silva, V. G. Bittencourt, and D. S. A.de Araujo, “Cluster ensemble for gene expression microarray data,” inProc. of International Joint Conference on Neural Networks (IJCNN).IEEE Press, 2005, pp. 487–492.
[8] L. I. Kuncheva, Combining Pattern Classifiers. Wiley, 2004.
[9] A. K. Jain and R. C. Dubes, Algorithms for clustering data. PrenticeHall, 1988.
[10] I. H. Witten and E. Frank, Data mining: practical machine learning
tools and techniques with Java implementation, 2nd ed. USA: MorganKaufman Publishers, 2004.
[11] A. P. Topchy, A. K. Jain, and W. F. Punch, “Combining multiple weakclusterings.” in Proceedings of the 3rd IEEE International Conference
on Data Mining (ICDM 2003), 2003, pp. 331–338.
[12] X. Z. Fern and C. E. Brodley, “Cluster ensembles forhigh dimensional clustering: An empirical study,” 2004,http://web.engr.oregonstate.edu/ xfern/clustensem.pdf.
[13] A. L. N. Fred and A. K. Jain, “Combing multiple clusterings usingevidence accumulation,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, no. 6, pp. 835–850, 2005.
[14] E. Dimitriadou, A. Weingessel, and K. Hornik, “A combination schemefor fuzzy clustering,” International Journal of Pattern Recognition and
Artificial Intelligence, vol. 16, no. 7, pp. 901–912, 2002.
2179
[15] R. Dubes, “How many clusters are best? An experiment,” Pattern
Recognition, vol. 20, no. 6, pp. 645–663, 1987.[16] G. W. Milligan and M. C. Cooper, “A study of the comparability
of external criteria for hierarchical cluster analysis,” Multivariate
Behavorial Research, vol. 21, pp. 441–458, 1986.[17] S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova, “Moderate
diversity for better cluster ensembles,” Information Fusion, 2005, tobe published.
[18] I. G. Costa, F. A. T. de Carvalho, and M. C. P. de Souto, “Comparativestudy on proximity indices for cluster analysis of gene expression timeseries,” Journal of Inteligent and Fuzzy Systems, pp. 133–142, 2003.
[19] P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,” Journal of Computational and AppliedMathematics, vol. 20, pp. 53–65, 1987.
[20] T. G. Dietterich, “Approximate statistical test for comparing supervisedclassification learning algorithms,” Neural Computation, vol. 10, no. 7,pp. 1895–1923, 1998.
2180