[IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC,...

7
Cluster Ensemble for Gene Expression Microarray Data: Accuracy and Diversity Marcilio C. P. de Souto, Daniel S. A. de Araujo, Bruno L. C. da Silva Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte Natal, RN, Brazil, 59072-970 E-mail: [email protected], [email protected], [email protected] Abstract— The classification of different types of cancer, historically, depended on efforts by the biologists that tried to establish, based on assumptions, the subtypes of a given tumor. However, up to now, there is no well-grounded methodology that aids to deal with such a task. One step towards this has arisen with the idea of analyzing the gene expression of tumors and basing the decision on such an analysis. In this context, we analyze the potential of applying cluster ensemble techniques to gene expression microarray data. Our experimental results show that there is often a significant improvement in the results obtained with the use of ensemble when compared to those based on the clustering techniques used individually. I. I NTRODUCTION The aim of this paper is to investigate the potential of applying cluster analysis techniques to gene expression microarray data. As pointed out by [1], this type of un- supervised analysis is of increasing interest in the field of functional genomics and gene expression data analysis. One of the reasons for this is the need for molecular- based refinement of broadly defined biological classes, with implications in cancer diagnosis, prognosis and treatment [1], [2], [3]. More specifically, we develop experiments with the cluster ensemble methods described in [4], [5], [6]. In a previous work [7], we presented preliminary results with the technique in [5]. The goal of the clustering combining methods (ensem- bles or committees) is to find a consensus among different results (partitions) generated with one or more clustering algorithms. The resulting consensus partition should be more robust than the original ones [8]. In order to develop our experiments, the partitions pro- duced by three clustering algorithms, representative of dif- ferent clustering paradigms, are selected as input for building the ensembles: the k-means, the Expectation-Maximization (EM) algorithm, and the hierarchical method with average linkage [9], [10]. These algorithms have been widely used in the gene expression literature [1], [2], [3]. Furthermore, three gene expression datasets are used in this work [1]: St. Jude Leukemia, Novartis multi-tissue, Gaussian3. These datasets allow us, among other things, an- alyze the ability of the clustering method finding refinement of broadly defined biological classes. The remain of this paper is divided into six sections. Section II presents the main concepts on the cluster ensem- bles methods. In Section III, we introduce the evaluation methodology for our experiments. The description of the datasets used are shown in Section IV. Our experimental setup is presented in Section V. In Section VI, we show the results obtained with the individual clustering methods, as as for the homogeneous and the heterogeneous ensembles (in terms of accuracy and diversity). Finally, Section VII summarizes the main results obtained and indicates some possible future work. II. CLUSTER ENSEMBLE A cluster ensemble consists of two parts [11], [12], [8]: a base partition constructor and a consensus function. Given a data set, a base partition constructor generates a collection of clustering (base partition) solutions. A consensus function then combines the base partitions and produces a single partition as the final output of the ensemble system. Below we formally describe these two parts. [11], [12] Base partitions generation strategy. Given a set of n objects (instances) X = {x 1 , x 2 , ..., x n }, a base partition constructor generates a collection of partitions, denoted by Π= {π 1 2 , ..., π r }, where r is the number of base partitions. Each π i is a partition of X in K i disjoint groups (clusters) of objects, denoted by π i = {c i 1 ,c i 2 , ..., c i K i }, where K i j=1 c i j = X. [12] Intergration strategy (consensus function). Given a set Π of base partitions and a number K of groups to be generated, a consensus function Γ uses the information provided by Π to partitioning X into K disjoint groups, yielding π f as final solution. With respect to the first part, several approaches compara- ble to those used in supervised learning have been proposed to introduce artificial instabilities in clustering algorithms [8]. For example, in this paper, we analyze ensembles based on combining several runs of the same clustering algorithm with different initial conditions. Another approach studied is to generate ensembles from the combination of the results of different clustering methods. The combination of multi-partitions, the second part, is often accomplished via the definition of a consensus func- 0-7803-9490-9/06/$20.00/©2006 IEEE 2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 2174

Transcript of [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC,...

Page 1: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

Cluster Ensemble for Gene Expression Microarray

Data: Accuracy and Diversity

Marcilio C. P. de Souto, Daniel S. A. de Araujo, Bruno L. C. da Silva

Department of Informatics and

Applied Mathematics

Federal University of Rio Grande do Norte

Natal, RN, Brazil, 59072-970

E-mail: [email protected], [email protected], [email protected]

Abstract— The classification of different types of cancer,historically, depended on efforts by the biologists that tried toestablish, based on assumptions, the subtypes of a given tumor.However, up to now, there is no well-grounded methodologythat aids to deal with such a task. One step towards this hasarisen with the idea of analyzing the gene expression of tumorsand basing the decision on such an analysis. In this context, weanalyze the potential of applying cluster ensemble techniquesto gene expression microarray data. Our experimental resultsshow that there is often a significant improvement in the resultsobtained with the use of ensemble when compared to thosebased on the clustering techniques used individually.

I. INTRODUCTION

The aim of this paper is to investigate the potential

of applying cluster analysis techniques to gene expression

microarray data. As pointed out by [1], this type of un-

supervised analysis is of increasing interest in the field

of functional genomics and gene expression data analysis.

One of the reasons for this is the need for molecular-

based refinement of broadly defined biological classes, with

implications in cancer diagnosis, prognosis and treatment [1],

[2], [3].

More specifically, we develop experiments with the cluster

ensemble methods described in [4], [5], [6]. In a previous

work [7], we presented preliminary results with the technique

in [5]. The goal of the clustering combining methods (ensem-

bles or committees) is to find a consensus among different

results (partitions) generated with one or more clustering

algorithms. The resulting consensus partition should be more

robust than the original ones [8].

In order to develop our experiments, the partitions pro-

duced by three clustering algorithms, representative of dif-

ferent clustering paradigms, are selected as input for building

the ensembles: the k-means, the Expectation-Maximization

(EM) algorithm, and the hierarchical method with average

linkage [9], [10]. These algorithms have been widely used

in the gene expression literature [1], [2], [3].

Furthermore, three gene expression datasets are used in

this work [1]: St. Jude Leukemia, Novartis multi-tissue,

Gaussian3. These datasets allow us, among other things, an-

alyze the ability of the clustering method finding refinement

of broadly defined biological classes.

The remain of this paper is divided into six sections.

Section II presents the main concepts on the cluster ensem-

bles methods. In Section III, we introduce the evaluation

methodology for our experiments. The description of the

datasets used are shown in Section IV. Our experimental

setup is presented in Section V. In Section VI, we show the

results obtained with the individual clustering methods, as

as for the homogeneous and the heterogeneous ensembles

(in terms of accuracy and diversity). Finally, Section VII

summarizes the main results obtained and indicates some

possible future work.

II. CLUSTER ENSEMBLE

A cluster ensemble consists of two parts [11], [12], [8]: a

base partition constructor and a consensus function. Given a

data set, a base partition constructor generates a collection

of clustering (base partition) solutions. A consensus function

then combines the base partitions and produces a single

partition as the final output of the ensemble system. Below

we formally describe these two parts.

• [11], [12] Base partitions generation strategy. Given

a set of n objects (instances) X = {x1,x2, ...,xn},

a base partition constructor generates a collection of

partitions, denoted by Π = {π1, π2, ..., πr}, where r is

the number of base partitions. Each πi is a partition of

X in Ki disjoint groups (clusters) of objects, denoted

by πi = {ci1, c

i2, ..., c

iKi}, where ∪Ki

j=1cij = X .

• [12] Intergration strategy (consensus function).

Given a set Π of base partitions and a number K of

groups to be generated, a consensus function Γ uses the

information provided by Π to partitioning X into K

disjoint groups, yielding πf as final solution.

With respect to the first part, several approaches compara-

ble to those used in supervised learning have been proposed

to introduce artificial instabilities in clustering algorithms [8].

For example, in this paper, we analyze ensembles based on

combining several runs of the same clustering algorithm with

different initial conditions. Another approach studied is to

generate ensembles from the combination of the results of

different clustering methods.

The combination of multi-partitions, the second part, is

often accomplished via the definition of a consensus func-

0-7803-9490-9/06/$20.00/©2006 IEEE

2006 International Joint Conference on Neural NetworksSheraton Vancouver Wall Centre Hotel, Vancouver, BC, CanadaJuly 16-21, 2006

2174

Page 2: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

tion. For example, in this paper, we develop experiments

with three successful cluster ensemble (consensus functions)

methods:

• Co-association matrix [4], [13].

• Re-labeling and voting [6].

• Methods based on graph partitioning [5].

In the combination strategy based on co-association ma-

trix, the similarity between two objects is estimated by count-

ing the number of shared clusters in all the base partitions.

The underlying assumption is that objects belonging to a

natural cluster are very likely to be placed in the same group

in different partitions. This similarity expresses the force of

co-association of each pair of objects [4], [13].

In fact, this matrix can be seen as a similarity matrix.

Therefore, it can be used as input to any clustering algorithm

that operates directly with a similarity matrix. For example,

in order to find the consensus partition, [4] apply the hierar-

chical method with single linkage to this matrix - the partition

is generated cutting the dendogram at a given height. In this

work, as [4], we will use the hierarchical method with single

linkage to find the consensus function. Hereafter, for short,

we will refer to this method as Co-association.

The other consensus function analyzed is based on a

voting scheme [6]. In the voting scheme, we try to find a

correspondence between the cluster labels across the different

partitions. Once this correspondence is found, the clusters

with the same label are fused. Solving the correspondence

problem is important because the labels that we assign

to the clusters in the individual partitions are arbitrary.

Consequently, if two identical partitions have the labels of

their clusters permuted, they could be perceived as different

ones. For k clusters, there are k! label permutations and an

exhaustive search may not be possible for large k.

In order to avoid the need of an exhaustive search, [14]

proposed a heuristic approximation to consistent labeling.

Their algorithm combines r partitions on a sequential basis:

at each step, a locally optimal permutation is found and two

partitions are fused. After voting of r partitions, we obtain

for every data point xi and every cluster j a value pij . This

value represents the fraction of times that xi has be assigned

to k. For interpreting the final result, we can either accept

this fuzzy decision or assign every data point xi to the cluster

with highest pij . In this paper, we use the latter.

With respect to the ensembles methods in [5], they for-

mulate the consensus function as a graph partitioning task.

Given a weighted graph, a general goal of graph partitioning

is to find a K-way partition that minimizes the cut (sum

of the weights of the edges to be removed), subject to the

constraint that each part should contain roughly the same

number of vertices [12]. Each connected component after

the cut corresponds to a cluster in the consensus partition.

In practice, various graph partitioning algorithms define

different optimization criteria based on the above goal. In

this context, [5] introduce three different consensus func-

tions, each of which formulates and solves a different graph

partitioning problem given the set of base partitions. The

first approach, CSPA (Cluster-based Similarity Partitioning

Algorithm), builds a co-association matrix, which is used to

create a graph - objects are vertices and their similarities

are the weights of the edges. The second function, HGPA

(HiperGraph-Partitioning Algorithm), models every cluster

of the base partitions as a hyperedge of a hipergraph. Finally,

the third consensus function, MCLA ( Meta-CLustering

Algorithm), models clusters as vertices in a r-partite graph,

where r is the number of base partitions. Hereafter, for short,

we will refer to these three consensus function as Graph.

III. EVALUATION METHODOLOGY

The evaluation we use is aimed mainly at assessing how

good the clustering method investigated is at recovering

known clusters from gene expression microarray data. In

order to do so, we consider three datasets for which multi-

class distinction is available. As in other works, each of

this datasets constitutes the gold standard against which

we evaluate the clustering results [1], [9]. Following the

conversion in [1], we refer to the gold standard partition

as class, while we reserve the word clusters for the partition

returned by the clustering algorithm.

A. Accuracy and Diversity

In the context of clustering learning algorithms, there is no

definitive measure of accuracy [9]. Nevertheless, in cases in

which class labels are available, external evaluation criteria

can provide an effective means of assessing the quality of a

partition [9], [8].

For example, the cluster composition can be evaluated by

measuring the degree of agreement between two partitions

(U and V), where partition U is the result of a clustering

method and partition V (the gold standard) is formed by an a

priori information independent of partition U, such as a class

label. There are a number of external indices defined in the

literature, such as Hubbert, Jacard, Rand and corrected Rand

(or adjusted Rand) [9] that can be used for this measurement.

One characteristic of most of these indices is that they can

be sensitive to the number of classes in the partitions or to

the distributions of elements in the clusters. For example,

some indices have a tendency to present higher values for

partitions with more classes (Hubbert and Rand), others for

partitions with a smaller number of classes (Jaccard) [15].

The corrected Rand index, which has its values corrected

for chance agreement, does not have any of these undesirable

characteristics [16]. Thus, the corrected Rand index - cR,

for short - is the external index used in the evaluation

methodology used in this work. The corrected Rand index

can take values from -1 to 1, with 1 indicating a perfect

agreement between the partitions, and the values near 0

or negatives corresponding to cluster agreement found by

chance.

Formally, let U = {u1, . . . , ur, . . . , uR} be the par-

tition given by the clustering solution, and V ={v1, . . . , vc, . . . , vC} be the partition formed by an a priori

information independent of partition U (the gold standard).

The corrected Rand is defined as

2175

Page 3: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

cR =

∑R

i

∑C

j

(

nij

2

)

−(

n

2

)

−1 ∑R

i

(

ni·

2

)∑C

j

(

n·j

2

)

12 [

∑R

i

(

ni·

2

)

+∑C

j

(

n·j

2

)

] −(

n2

)

−1 ∑R

i

(

ni·

2

)∑C

j

(

n·j

2

)

where (1) nij represents the number of objects in clusters

ui and vj ; (2) ni· indicates the number of objects in cluster

ui; (3) n·j indicates the number of objects in cluster vj ; (4)

n is the total number of objects; and (5)(

ab

)

is the binomial

coefficient a!b!(a−b)! .

Besides assessing the accuracy of the ensembles by means

of the corrected Rand index, we also investigate the rela-

tionship between accuracy and diversity. Diversity within an

ensemble is of vital importance for its success. An ensemble

formed from identical base partitions will not outperform its

individual members [17].

In order to analyze this aspect, we will use a diversity

measure proposed in [17], which is based on the corrected

Rand index (cR). This measure is a pairwise one in that the

diversity of an ensemble will be proportional to the diversity

between its individual components (Equation 1). Formally,

suppose that our ensemble partition πf was generated from

Π = {π1, π2, ..., πr}, its diversity can be calculated using

the following pairwise measure.

D =2

r(r − 1)

r−1∑

i=1

r∑

j=i+1

(1 − cR(πi, πj)) (1)

B. Cross-validation

In order to compare the performance of the clustering

algorithms, we calculate the mean of the cR via a unsu-

pervised k-fold cross-validation procedure [18]. The data

set is, in the unsupervised k-fold cross-validation procedure,

divided in k folds. At each iteration of the procedure, one

fold is used as the test set, and the remaining folds as the

training set. The training set is presented to a clustering

method, giving a partition as result (training partition). Then,

the nearest centroid technique is used to build a classifier

from the training partition. The centroid technique calculates

the proximity between the elements in the test set and

the centroids of each cluster in the training partition (the

proximity must be measured with the same proximity index

used by the clustering method evaluated).

A new partition (test partition) is then obtained by as-

signing each object in the test set to the cluster with nearest

centroid. Next, the test partition is compared with the a priori

partition (gold standard) by using an external index (this a

priori partition contains only the objects of the test partition).

At the end of the procedure, a sample with size k of the

values for the external index is available.

The general idea of the k-fold cross-validation procedure

is to observe how well data from an independent set is

clustered, given the training results. If the results of a training

set have a low agreement with the a priori classification,

so should have the results of the respective test set. In

conclusion, the objective of the procedure is to obtain k

observations of the accuracy of the unsupervised methods

with respect to the gold standard, all this with the use of

independent test folds.

IV. DATASETS

The three gene expression datasets used in this work are

the St. Jude leukemia, Novartis multi-tissue, and Gaussian3

- exactly as presented in [1] - Table I. St. Jude Leukemia

and Novartis multi-tissue are real datasets. More specif-

ically, the former is composed by instances representing

diagnosed samples of bone marrow from pediatric acute

leukemia patients, corresponding to six prognostically impor-

tant leukemia subtypes - 43 T-lineage ALL; 27 E2A-PBX1;

15 BCR-ABL; 79 TEL-AML1 and 20 MLL rearrangements;

and 64 “hyperdiploid > 50” chromosomes. The instances of

the latter represent tissue samples from four distinct cancer

types - 16 breast, 26 prostate, 28 lung, and 23 colon samples.

TABLE I

DESCRIPTION OF THE DATASETS

Dataset # Classes # Instances # Attributes

Gaussian3 3 60 600

Novartis multi-tissues 4 103 1000

St. Jude leukemia 6 248 1000

In contrast to St. Jude Leukemia and Novartis multi-

tissue datasets, Gaussian3 is an artificial dataset. Gaussian3

represents the union of three Gaussian distributions (three

classes) in a 600-dimension space. Such data simulates

the process of gene co-regulation. Each class is uniquely

characterized by a subset formed by 200 attributes (genes),

which are set to represent up-regulated values in this class

and down-regulated values for the other two.

V. EXPERIMENTS

The experiments were accomplished by presenting the

three datasets (Gaussian3, Norvatis multi-tissue, and St. Jude

Leukemia) to the individual clustering methods: k-means,

EM algorithm, and the hierarchical method with average link-

age (all of them implemented with the Euclidian distance).

Initially, five replication of the 2-fold cross-validation of each

dataset were performed such that 10 samples of each dataset

was formed.

In terms of parameters settings, the number of cluster for

the k-means and the EM algorithm were varied in two ways.

First, we conducted experiments in which such a number was

set to the number of classes in the dataset being considered -

e.g., c=3 for the Gaussian3 dataset. Then, we did an another

run of experiments in which the number of clusters was set

to the double of the number of classes - e.g., c=6 for the

Gaussian3 dataset.

Furthermore, as these algorithms are dependent on the

choice of the initial conditions (e.g., initial centers and

means), for them we repeated each run 10 times, each one

with a distinct random initialization. For example, for each

2176

Page 4: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

one of the 10 cross-validation samples of each dataset, the

experiment with k-means (or the EM algorithm) yielded 10

partitions - Table II (cR stands for the mean of the corrected

Rand for the independent test sets).

Since the hierarchical method with average linkage is

deterministic, only 10 runs of this algorithm was executed,

that is, one run for each cross-validation sample of the

dataset. Also, as the external index used in this work is

suitable only for partition comparison, in order to build the

partition from the hierarchical methods, the trees were run

from root to the leaves, then the c first sub-trees were taken

as the clusters (with c equal to both the exact and the double

of the number of classes in the dataset) - Table II.

The ensembles were built according to the type of base

partitions used as input (with c or 2*c clusters per partition) -

Tables III, IV, V. The first type - Ensemble-(c,c) - received as

input base partitions whose number of clusters was set to the

exact number c of classes in the datasets. The final partition

was also set to have K=c clusters. In contrast, the number of

clusters for the base partitions for the second configuration

- Ensemble-(2*c,c) - was set to 2*c. Here, like in the first

case, the number of clusters in the final partition was also

set to K=c.

In the context above, the ensembles were formed as

follows. First, for the k-means (EM algorithm) we created 10

ensemble, where each ensemble was formed by combining

the 10 runs of the k-means (EM algorithm) for a given

cross-validation partition (10 samples in total - 2x5 cross-

validation). That is, we formed homogeneous ensembles in

that the partitions used to form the consensus come from the

same type of algorithm. As the voting scheme in [6] imposes

that the number of clusters in the final partition and in the

base partitions be the same, such a method was used only

with the first ensemble configuration - Ensemble(c,c). Thus,

in Tables III, IV, V, the term “n/a” stands for not applicable.

We also combine partitions produced by the different

clustering algorithms (heterogeneous ensembles). But before

doing so, for the k-means experiment (EM algorithm ex-

periment), from the 10 partitions generated by the repeated

random restarts, we choose only one for further analysis: the

one with the largest Silhouette index [19]. In a Silhouettes

calculation, the distance from each data point in a cluster

to all other data points within the same cluster (tightness)

and to all data points in the closest cluster (separation)

are determined. Based on this value, one can calculate the

average Silhouette for the partition being considered. A good

partition of the dataset is indicated by the largest values of

the Silhouettes index.

In this context, the heterogeneous ensembles were formed

according to the cross-validation samples for each dataset,

that is, we built 10 different ensembles. More specifically,

the inputs for building a given ensemble (e.g., for a given

cross-validation sample for a certain dataset) were the mul-

tiple clustering label vectors, where each vector represented

the resulting partition of a given individual technique - in

Tables III, IV, V, the term “Heter.” stands for ensemble

heterogeneous.

Finally, for all the experiments, the mean values of the

corrected Rand index (cR) for the test folds were measured.

Next, the mean of cR obtained with the individual methods

were compared two by two to those obtained with the

ensembles. This was accomplished by means of a paired t-

test, as described in [20].

VI. RESULTS

A. Homogeneous ensembles: accuracy

According to Table III, for the Gaussian3 dataset, the Co-

association consensus function obtained for both k-means

and EM (homogeneous ensembles) an average accuracy

significantly higher than the respective individual methods.

Still in this context, there is no statistical evidence to state

the difference between the performance of the Co-association

ensemble with the configurations (c,c) and (2 ∗ c,c). For the

Voting consensus function, besides the significant gain in

accuracy, we can observe a great decrease in the standard

deviation. For example, the individual EM algorithm for c=3

(Table II) presented a cR mean of 0.58 with a standard

deviation of 0.26, whereas the Voting ensemble formed by

these EM partitions (Table III) showed a cR mean of 0.99

with a standard deviation of 0.03. There was even a greater

improvement of the results when the Graph consensus func-

tion was applied. Such ensembles presented a cR mean of

1.00, that is, a perfect agreement with the a priori partition.

For the Norvatis multi-tissue dataset, according to Ta-

ble IV, the use of the Co-association consensus function

does not lead to a gain in performance when compared to

the individual methods. In fact, in the case of the (2 ∗ c,c)

configuration, the null hypothesis was rejected in favor of

the k-means and the EM. In contrast, for the case of the

Voting consensus function, like the results obtained for the

Gaussian3 dataset, there was a significant increase in terms

of accuracy (cR mean) as well as robustness (standard

deviation). For instance, the k-means for c=4 (Table II)

presented a cR mean of 0.75 with a standard deviation of

0.20, whereas the Voting ensemble formed by these k-means

partitions (Table IV) showed a cR mean of 0.94 with a

standard deviation of 0.01. The Graph consensus function

presented a behavior similar to the Voting one.

In general, the St. Jude leukemia dataset (Table V) was

the hardest one to be clustered by the ensemble methods.

For example, most of the homogeneous ensembles formed

presented a performance comparable or inferior to the indi-

vidual clustering methods.

B. Heterogeneous ensembles: accuracy

The results for the Gaussian3 dataset (Table III) when

applied to the Co-association consensus function were, in

general, inferior to those of the individual methods. In

contrast, the Voting and the Graph consensus functions, like

in the case of homogeneous ensembles, obtained a cR mean

of 1.0, that is, a perfect agreement with the a priori partition.

Table IV illustrates the performance of ensembles methods

for the Novartis-multi tissue dataset. Here, as for the other

2177

Page 5: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

datasets, the results obtained with the Co-association con-

sensus function were similar (configuration (c,c)) or worse

(configuration (c,2∗c) than those obtained with the respective

individual methods. On the other hand, in the case of the

Voting consensus function, the same behavior pattern of

the homogenous ensembles occurred - increase in terms

of accuracy (cR mean) as well as robustness (standard

deviation). For instance, the hierarchical method with average

linkage for c=4 (Table II) presented a cR mean of 0.78 with

a standard deviation of 0.14, whereas the Voting ensemble

formed by the three different methods (Table IV) showed

a cR mean of 0.96 with a standard deviation of 0.03. The

Graph consensus function also presented a performance sig-

nificantly superior to the individual methods. In this context,

there was no statistical evidence to state the difference of

performance between the (c,c) and (c,2 ∗ c) configurations.

In terms of the St. Jude leukemia dataset, for the two

ensemble configurations - (c,c) and (2 ∗ c,c) - the results

obtained with the Co-association consensus function were

inferior to those achieved by the respective individual meth-

ods. The null hypotheses were rejected in favor of the

individual methods. The heterogeneous ensembles built with

the Voting consensus function (Table V) presented a signifi-

cantly higher accuracy (0.93) than the individual methods (k-

means=0.80, EM=0.83, and hierarchical=0.50). The Graph

consensus function presented a behavior similar to that of the

Co-association one, that is, no improvement when compared

to the individual methods.

C. Ensemble diversity

Table VI illustrates the mean of the Diversity measure (D)

- Equation 1, for each set of base partitions used as input

to build the ensembles. According to this table, although

combining clusters from multiple partitions is useful only

if there is disagreement between the partitions, diversity in

itself is not a sufficient condition to determine the future

behavior of ensemble formed.

For instance, with exception for the St. Jude leukemia data

for c=6, the diversities of the base partitions used as input

for the heterogeneous ensembles were much lower than the

ones for the homogeneous ensembles. On the other hand,

the accuracy of the heterogeneous ensembles were, with the

exception of the Co-association consensus function, superior

or equivalent to those of the homogeneous ensembles. Thus,

not only the diversity, but the choice of the type of consensus

function has a great impact on the success of the ensemble.

The behavior described in the previous paragraph can be

observed, for example, in the case of the Gaussian3 dataset.

The diversity mean for c=3 for the k-means ensembles was

0.80 (Table VI), whereas the accuracy means for the the Co-

association, the Voting, and the Graph consensus functions

were, respectively, 0.68, 0.92, and 1.0. That is, despite of

the high diversity of the base partitions, the Co-association

consensus function was not very successful in recovering the

underlying structure in the dataset.

This kind of behavior was also observed in the context

of lower diversity. For instance, in the case of c=3 for the

Gaussian dataset, the diversity mean for the base partitions

for the heterogeneous ensembles was only of 0.21 (Table VI).

As one can see at Table II, the means (and standard devia-

tions) of the corrected Rand indexes for the k-means and the

EM were, respectively, 0.56± 0.30 and 0.58± 0.26. That is,

if the Silhouette index (see Section V) was able to identify

the best partitions to form the ensembles, they would have

corrected Rand indexes around, respectively, 0.86 and 0.84.

Thus, for presenting a high value for the corrected Rand

indexes, these base partition, though presenting a lower

diversity (see Equation 1), would have high quality in terms

of accuracy. In this case, most of the diversity would come

from the partitions built with the hierarchical method - they

presented a cR mean of only 0.50 with very small standard

deviation (0.03).

In the context above, whereas the Voting and the Graph

consensus functions were able to take the best of the base

partitions in order to improve the mean of the accuracy to

1.0± 0.0 (perfect agreement with the a priori partition); the

Co-association consensus function led the accuracy to only

0.48± 0.29.

VII. CONCLUSIONS

In this paper, we discussed cluster ensemble methods and

conducted a series of experiments with real and synthetic

gene expression datasets. More specifically, we analyzed,

from an experimental point of view, some strategies for

generating (homogeneous and heterogeneous) and integrating

(Co-association, Voting, and Graph consensus functions) the

ensembles.

The experimental analysis indicates that the ensemble

techniques used, often, offered considerable potential to

improve the accuracy, when compared to those achieved

by the individual clustering methods (k-means, EM, and

hierarchical method with average linkage). In general, there

was no significant difference between the cluster ensem-

bles methods studied, with exception of the Co-association

consensus function with the hierarchical method with single

linkage as meta-clustering algorithm that presented often a

poorer performance.

This can be clearly observed for the St. Jude leukemia

dataset (Table V). One of the reasons for this poor perfor-

mance could be the fact that clusters in such a dataset are not

well-separated. This kind of situation is not very well dealt

with the single linkage algorithm [9], [13]. An alternative for

this problem would be the use of other clustering methods,

such as the average linkage, as meta-clustering [13].

Our results have also shown, at least for the methods and

the datasets used, that there was no significant performance

gain by combining partitions produced by different clustering

algorithms (heterogeneous ensembles), compared to the strat-

egy of combining several runs of each of clustering algorithm

(homogeneous ensembles). However, this issue should be

further investigated.

Furthermore, the experiments conducted showed that even

when the consensus functions were set to (1) get as input

base partitions whose number of clusters was the double of

2178

Page 6: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

the number of the known class, and (2) to produce as output a

final partition with the true number of class; the final partition

generated were as accurate as the original base partitions.

Finally, our experimental result corroborates the claim that,

although the combination of clusters from multiple partitions

is useful only if there is disagreement between the partitions,

diversity alone is not a sufficient condition to determine the

future behavior of ensemble formed. In fact, not only the

diversity, but the choice of the type of consensus function

has a great impact on the success of the ensemble.

TABLE II

INDIVIDUAL METHODS

Dataset Algorithm cR for c cR for 2 ∗ c

Gaussian3k-means 0.56 ± 0.30 0.77 ± 0.24

EM 0.58 ± 0.26 0.76 ± 0.24

Hier. 0.50 ± 0.03 0.93 ± 0.03

Novartisk-means 0.75 ± 0.20 0.80 ± 0.08

EM 0.75 ± 0.20 0.79 ± 0.09

Hier. 0.78 ± 0.14 0.80 ± 0.08

St. Judek-means 0.80 ± 0.12 0.70 ± 0.09

EM 0.83 ± 0.09 0.70 ± 0.09

Hier. 0.50 ± 0.03 0.92 ± 0.03

TABLE III

ENSEMBLE METHODS - GAUSSIAN3

Ensemble Base cR for c=3 cR for c=6

Co-association

k-means 0.68 ± 0.23 0.82 ± 0.37

EM 0.73 ± 0.23 0.63 ± 0.30

Heter. 0.48 ± 0.29 0.17 ± 0.27

Votingk-means 0.92 ± 0.18 n/aEM 0.99 ± 0.03 n/aHeter. 1.00 ± 0.00 n/a

Graphk-means 1.00 ± 0.00 1.00 ± 0.0

EM 1.00 ± 0.00 1.00 ± 0.0

Heter. 1.00 ± 0.00 1.00 ± 0.0

TABLE IV

ENSEMBLE METHODS - NOVARTIS MULTI-TISSUE

Ensemble Base cR for c=4 cR for c=8

Co-association

k-means 0.71 ± 0.22 0.60 ± 0.24

EM 0.84 ± 0.16 0.86 ± 0.08

Heter. 0.69 ± 0.21 0.50 ± 0.30

Votingk-means 0.94 ± 0.01 n/aEM 0.95 ± 0.03 n/aHeter. 0.96 ± 0.03 n/a

Graphk-means 0.96 ± 0.02 0.96 ± 0.02

EM 0.96 ± 0.02 0.96 ± 0.02

Heter. 0.96 ± 0.02 0.95 ± 0.03

ACKNOWLEDGMENT

We would like to thank Evgenia Dimitriadou for providing

her code for the voting-merging clustering algorithm. The

authors also would like to thank the financial support of

CNPq via grant 470319/2004-6.

TABLE V

ENSEMBLE METHODS - ST. JUDE LEUKEMIA

Ensemble Base cR for c=6 cR for c=12

Co-association

k-means 0.25 ± 0.13 0.22 ± 0.15

EM 0.18 ± 0.18 0.28 ± 0.16

Heter. 0.35 ± 0.07 0.16 ± 0.12

Votingk-means 0.77 ± 0.13 n/aEM 0.75 ± 0.12 n/aHeter. 0.93 ± 0.04 n/a

Graphk-means 0.79 ± 0.07 0.46 ± 0.02

EM 0.87 ± 0.07 0.45 ± 0.03

Heter. 0.74 ± 0.10 0.82 ± 0.08

TABLE VI

MEAN OF THE DIVERSITY INDEX

Dataset Algorithm D for c D for 2 ∗ c

Gaussian3k-means 0.80 ± 0.09 0.75 ± 0.06

EM 0.79 ± 0.08 0.71 ± 0.04

Heter. 0.21 ± 0.16 0.52 ± 0.12

Novartisk-means 0.40 ± 0.11 0.41 ± 0.04

EM 0.40 ± 0.05 0.43 ± 0.04

Heter. 0.16 ± 0.13 0.24 ± 0.06

St. Judek-means 0.31 ± 0.05 0.46 ± 0.02

EM 0.25 ± 0.04 0.45 ± 0.03

Heter. 0.36 ± 0.02 0.32 ± 0.06

REFERENCES

[1] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus clustering:a resampling-based method for class discovery and visualization ofgene expression microarray data,” Machine Learning, vol. 52, pp. 91–118, 2003.

[2] J. Quackenbush, “Computational analysis of cDNA microarray data,”Nature Reviews, vol. 6, no. 2, pp. 418–428, 2001.

[3] D. Slonim, “From patterns to pathways: gene expression data analysiscomes of age,” Nature Genetics, vol. 32, pp. 502–508, 2002.

[4] A. L. N. Fred and A. K. Jain, “Data clustering using evidence accu-mulation.” in 16th International Conference on Pattern Recognition,2002, pp. 276–280.

[5] A. Strehl and J. Ghosh, “Cluster ensembles – a knowledge reuseframework for combining multiple partitions,” Journal on MachineLearning Research (JMLR), vol. 3, pp. 583–617, 2002.

[6] E. Dimitriadou, A. Weingessel, and K. Hornik, “A cluster ensemblesframework,” in Third International Conference on Hybrid IntelligentSystems (HIS), 2003, pp. 528–534.

[7] M. C. P. de Souto, S. C. M. Silva, V. G. Bittencourt, and D. S. A.de Araujo, “Cluster ensemble for gene expression microarray data,” inProc. of International Joint Conference on Neural Networks (IJCNN).IEEE Press, 2005, pp. 487–492.

[8] L. I. Kuncheva, Combining Pattern Classifiers. Wiley, 2004.

[9] A. K. Jain and R. C. Dubes, Algorithms for clustering data. PrenticeHall, 1988.

[10] I. H. Witten and E. Frank, Data mining: practical machine learning

tools and techniques with Java implementation, 2nd ed. USA: MorganKaufman Publishers, 2004.

[11] A. P. Topchy, A. K. Jain, and W. F. Punch, “Combining multiple weakclusterings.” in Proceedings of the 3rd IEEE International Conference

on Data Mining (ICDM 2003), 2003, pp. 331–338.

[12] X. Z. Fern and C. E. Brodley, “Cluster ensembles forhigh dimensional clustering: An empirical study,” 2004,http://web.engr.oregonstate.edu/ xfern/clustensem.pdf.

[13] A. L. N. Fred and A. K. Jain, “Combing multiple clusterings usingevidence accumulation,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 27, no. 6, pp. 835–850, 2005.

[14] E. Dimitriadou, A. Weingessel, and K. Hornik, “A combination schemefor fuzzy clustering,” International Journal of Pattern Recognition and

Artificial Intelligence, vol. 16, no. 7, pp. 901–912, 2002.

2179

Page 7: [IEEE The 2006 IEEE International Joint Conference on Neural Network Proceedings - Vancouver, BC, Canada (2006.07.16-2006.07.21)] The 2006 IEEE International Joint Conference on Neural

[15] R. Dubes, “How many clusters are best? An experiment,” Pattern

Recognition, vol. 20, no. 6, pp. 645–663, 1987.[16] G. W. Milligan and M. C. Cooper, “A study of the comparability

of external criteria for hierarchical cluster analysis,” Multivariate

Behavorial Research, vol. 21, pp. 441–458, 1986.[17] S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova, “Moderate

diversity for better cluster ensembles,” Information Fusion, 2005, tobe published.

[18] I. G. Costa, F. A. T. de Carvalho, and M. C. P. de Souto, “Comparativestudy on proximity indices for cluster analysis of gene expression timeseries,” Journal of Inteligent and Fuzzy Systems, pp. 133–142, 2003.

[19] P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,” Journal of Computational and AppliedMathematics, vol. 20, pp. 53–65, 1987.

[20] T. G. Dietterich, “Approximate statistical test for comparing supervisedclassification learning algorithms,” Neural Computation, vol. 10, no. 7,pp. 1895–1923, 1998.

2180