Assessing clustering quality Scoring clustering solutions by their biological relevance.

44
Assessing Assessing clustering quality clustering quality Scoring clustering solutions by their Scoring clustering solutions by their biological relevance biological relevance
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of Assessing clustering quality Scoring clustering solutions by their biological relevance.

Assessing Assessing clustering qualityclustering qualityScoring clustering solutions by their Scoring clustering solutions by their biological relevancebiological relevance

Common SummaryCommon Summary

Motivation.Motivation. Existent solutions and their problems.Existent solutions and their problems. Background (Average Silhouette, Jaccard Background (Average Silhouette, Jaccard

coefficient, Homogeneity and Separation, coefficient, Homogeneity and Separation, ANOVA, KW-test).ANOVA, KW-test).

The The CClustering lustering QQuality uality SScore (CQS) method.core (CQS) method. Results (method implementation on real Results (method implementation on real

and simulated data).and simulated data). Conclusion.Conclusion.

MotivationMotivation

Different clustering algorithms yield different Different clustering algorithms yield different clustering solutions on the same data.clustering solutions on the same data.

The same algorithm yields different results for The same algorithm yields different results for different parameter settings. different parameter settings.

FFew works addressed the systematic ew works addressed the systematic comparison and evaluation of clustering comparison and evaluation of clustering resultsresults..

There is no consensus on choosing among the There is no consensus on choosing among the clustering algorithmsclustering algorithms..

Different measures for the quality of a clustering Different measures for the quality of a clustering solution are applicable in different situations.solution are applicable in different situations.

It depends on the data and on the availability of the It depends on the data and on the availability of the true solution.true solution.

In case the true solution is knownIn case the true solution is known, and we wish to , and we wish to compare it to another solution, we can use the compare it to another solution, we can use the Minkowski measure or the Jaccard coefMinkowski measure or the Jaccard coeffificient.cient.

When the true solution is not knownWhen the true solution is not known, there are some , there are some approaches (Homogeneity and Separation, Average approaches (Homogeneity and Separation, Average silhouettesilhouette)) ,but there is no agreed-upon approach for ,but there is no agreed-upon approach for evaluating the quality of a suggested solution. evaluating the quality of a suggested solution.

Quality assessing - existent state Quality assessing - existent state

Homogeneity and Separation approachHomogeneity and Separation approach

The method is based on intra-cluster homogeneity or inter-cluster The method is based on intra-cluster homogeneity or inter-cluster separation .separation .

The problem that the homogeneity and separation criteria are The problem that the homogeneity and separation criteria are inherently conflicting, as an improvement in one will usually inherently conflicting, as an improvement in one will usually correspond to worsening of the other.correspond to worsening of the other.

The ways to solve this problem are:The ways to solve this problem are:

- to fix the number of clusters and seek a solution with - to fix the number of clusters and seek a solution with maximum homogeneity ( maximum homogeneity (KK-means algorithm).-means algorithm).

- - present a curve of homogeneity versus separation present a curve of homogeneity versus separation ((a curve can a curve can show that one algorithm dominates another if it provides better show that one algorithm dominates another if it provides better homogeneity for all separation values, but typically different homogeneity for all separation values, but typically different algorithms will dominate in different value range). algorithms will dominate in different value range).

Quality assessing - existent stateQuality assessing - existent state

Clustering solutions can be assessed by applying Clustering solutions can be assessed by applying standard statistical techniques: multivariate analysis of standard statistical techniques: multivariate analysis of variance (MANOVA) for high-dimensional data and variance (MANOVA) for high-dimensional data and discriminant analysis for normally distributeddiscriminant analysis for normally distributed data.data.

For the case of non-normal data, there are several For the case of non-normal data, there are several extensions that require the data to be either low-extensions that require the data to be either low-dimensional or continuous.dimensional or continuous.

None of these methods applicable when wishing to test None of these methods applicable when wishing to test the signithe signifificance of a clustering solution based on high-cance of a clustering solution based on high-dimensional vectors of dependent biological attributes dimensional vectors of dependent biological attributes that do not necessarily follow a normal distribution and that do not necessarily follow a normal distribution and may even be discrete. may even be discrete.

BackgroundBackgroundDefinitionsDefinitions

NN - a set of - a set of n n elements , elements , C C ={={CC11, , ... ... , , CCll }} - a - a partition of these elements into partition of these elements into l l clusters.clusters.

MatesMates - - two elements from the same cluster two elements from the same cluster (with respect to (with respect to CC).).

Homogeneity of CHomogeneity of C is the average distance is the average distance between mates (similarity).between mates (similarity).

Separation of CSeparation of C is the average distance is the average distance between non-mates (dissimilarity).between non-mates (dissimilarity).

Jaccard coefficientJaccard coefficient

Jaccard coefficient is the proportion of correctly Jaccard coefficient is the proportion of correctly identified mates out of the sum of the correctly identified mates out of the sum of the correctly identified mates plus the total number of identified mates plus the total number of disagreements (pairs of elements that are mates in disagreements (pairs of elements that are mates in exactly one of the two solutions).exactly one of the two solutions).

A perfect solution has score 1, and the higher the A perfect solution has score 1, and the higher the score – the better the solution.score – the better the solution.

The method is useful only when the true solution is The method is useful only when the true solution is known. known.

Average silhouetteAverage silhouette

DeDefifinition of the nition of the silhouette silhouette of element of element j j is is ((bbjj − − aajj))// maxmax((aajj , , bbjj))

aajj is the average distance of element is the average distance of element j j from from other elements of its cluster, other elements of its cluster, bbjkjk is the average is the average distance of element distance of element j j from the members of from the members of cluster cluster CCkk , and , and bbjj = min {= min {k k : : j j not belongsnot belongs to to CCkk } } bbjk.jk.

The The average silhouette average silhouette is the mean of this ratio is the mean of this ratio over all elements (genes).over all elements (genes).

This method performs well in general, but fails This method performs well in general, but fails to detect fine cluster structures.to detect fine cluster structures.

The use of external informationThe use of external information

The main focus is the evaluation of clustering solutions The main focus is the evaluation of clustering solutions using external information. using external information.

Given an Given an n n × × p attribute matrix Ap attribute matrix A. The rows of . The rows of A A correspond to elements, and the correspond to elements, and the i-i-th row vector is called th row vector is called the the attribute vector attribute vector of element of element ii. .

Also given a clustering Also given a clustering C C ={={CC11, , ... ... , , CCll } of the elements, } of the elements, where where SSii =|=|CCii |.|.

We index the attribute vectors by the clustering:We index the attribute vectors by the clustering:

as the vector of element as the vector of element j j in cluster in cluster ii. . C C is obtained without using the information in is obtained without using the information in AA. The goal . The goal

is to evaluate is to evaluate C C with respect to with respect to AA. .

ANOVA testANOVA test WhenWhen p p = 1.= 1. The attribute is normally distributed, The attribute is normally distributed, The variances of the The variances of the l l population distributions are population distributions are

identical.identical.

We can use standard analysis of variance (ANOVA):We can use standard analysis of variance (ANOVA):

- - aaijij - the attribute of element - the attribute of element j j in cluster in cluster ii..- a- a¯̄ii - the mean of the elements in cluster - the mean of the elements in cluster i.i.- - aa¯- the total mean of all ¯- the total mean of all n n elements.elements.- H- H00: : µµ11 = = µµ22 = ··· = µ = ··· = µll , where , where µµii is the is the expectation expectation of group of group ii..

ANOVA testANOVA test

The test statistic is FThe test statistic is FHH::

SSH = SSH =

SSE = SSE = FFHH statistic has a statistic has a F F distribution with distribution with l −1l −1 and and n n − − l l

degrees of freedom. degrees of freedom. For the multidimensional case (For the multidimensional case (pp > > 1), the MANOVA 1), the MANOVA

test applies the same objective function test applies the same objective function FFHH if the if the attribute matrix is multinormally distributed. attribute matrix is multinormally distributed.

Kruskal-Wallis (KW) testKruskal-Wallis (KW) test

IIn n casecase thethe attributeattribute doesdoes notnot followfollow a a normalnormal distributiondistribution, , wewe can usecan use thethe KW KW testtest asas a a non-non-parametricparametric ANOVA ANOVA testtest. .

The test assumes that the clusters are The test assumes that the clusters are independent and have similar shape. independent and have similar shape.

We shall denote by We shall denote by PPKWKW(C(C, , A) A) the the pp-value -value obtained by the KW test for a clustering obtained by the KW test for a clustering C C using the attribute using the attribute AA. .

CQS MethodCQS Method

Will be introduced statistically based method for comparing Will be introduced statistically based method for comparing clustering solutions according to prior biological knowledge.clustering solutions according to prior biological knowledge.

The solutions are ranked according to their correspondence The solutions are ranked according to their correspondence to prior knowledge about the clustered elements.to prior knowledge about the clustered elements.

The method tests the dependency between the attributes The method tests the dependency between the attributes and the grouping of the elements.and the grouping of the elements.

The method computes a quality score for the functional The method computes a quality score for the functional enrichment of attribute classes among each solution’s enrichment of attribute classes among each solution’s clusters.clusters.

TheThe result resultss are represented asare represented as the CQS (Clustering Quality the CQS (Clustering Quality Score) of the clustering.Score) of the clustering.

Computing a linear combinationComputing a linear combinationof the attributesof the attributes

Each element is assigned a real value, which is a weighted sum of Each element is assigned a real value, which is a weighted sum of its attributes. its attributes.

An attribute’s An attribute’s weight weight is its coefficient in the linear combination.is its coefficient in the linear combination.

TheThe proposition proposition is is to use weights that maximize the ability to to use weights that maximize the ability to

discriminate between the clusters using the one-dimensional datadiscriminate between the clusters using the one-dimensional data.. Finding the weights will be done in the same manner as in Linear Finding the weights will be done in the same manner as in Linear

Discriminant Analysis (LDA)Discriminant Analysis (LDA).. LDA creates a linear combination LDA creates a linear combination by maximizing the ratio of by maximizing the ratio of

between-groups-variance to within-groups-variance. between-groups-variance to within-groups-variance.

Computing a linear combinationComputing a linear combinationof the attributesof the attributes

The statistic being maximized is the ratio of MSH to MSE:The statistic being maximized is the ratio of MSH to MSE:

The mean vector of cluster i

The total mean vector

The p-dimensional vector of weights

Computing a linear combinationComputing a linear combinationof the attributesof the attributes

The maximum value of F(w) is proportional to the greatest root of The maximum value of F(w) is proportional to the greatest root of

the equation | H – the equation | H – λλE | = 0E | = 0 .. H H is a is a p p ××p p matrix containing the between-groups sum of square:matrix containing the between-groups sum of square:

E E is a is a p p ××p p matrix of the sum of squared errors:matrix of the sum of squared errors:

The desired combination The desired combination w w is the is the

eigenvectoreigenvector corresponding to the greatest root. corresponding to the greatest root. This result holds without assuming any priorThis result holds without assuming any prior

distribution on the attributes. distribution on the attributes.

The total mean of attribute r

The mean of attribute r in

cluster i

ProjectionProjection

Applying the linear combination Applying the linear combination ww to the to the attribute vectors , thereby projecting attribute vectors , thereby projecting these vectors onto the real line:these vectors onto the real line:

Computing CQS using the Computing CQS using the projected valuesprojected values

We have to evaluate the clustering vis-á-vis the We have to evaluate the clustering vis-á-vis the projected attributes using the KW test. projected attributes using the KW test.

The value of CQS = (The value of CQS = (– log – log pp)).. p p = = PPKWKW(C(C, , Z)Z), - the , - the pp-value assigned to the -value assigned to the

clustering by the KW test. clustering by the KW test. The The pp-value is the probability that all values in -value is the probability that all values in

this particular projection have been taken from this particular projection have been taken from the same population. the same population.

CQS favors clustering solutions whose best CQS favors clustering solutions whose best discriminating weights enable significant discriminating weights enable significant grouping. grouping.

Estimating confidenceEstimating confidence

Estimation of the accuracy of the scores and the Estimation of the accuracy of the scores and the signisignifificance by evaluating the sensitivity of CQS to cance by evaluating the sensitivity of CQS to small modismall modififications of the clustering solution. cations of the clustering solution.

Each alternative solution is obtained by introducing Each alternative solution is obtained by introducing k k exchanges of random pairs of elements from different exchanges of random pairs of elements from different clusters of the original solution (clusters of the original solution (k k is elements). is elements).

The larger the inThe larger the inflfluence of small perturbations in the uence of small perturbations in the clustering on the CQS value - the smaller the clustering on the CQS value - the smaller the conconfifidence we have in the CQS. dence we have in the CQS.

The The CQS conCQS confifidence dence is the standard deviation of CQS is the standard deviation of CQS for the group of alternative clustering solutions. for the group of alternative clustering solutions.

The overall procedureThe overall procedure

1.1. Finding the eigenvector Finding the eigenvector w w corresponding to corresponding to the greatest root of the system of equationsthe greatest root of the system of equations

| |H H −−λλEE|=0.|=0.

2.2. For each attribute vector For each attribute vector aaijij set set

zzijij == ∑ ∑tt aaijijwwtt 3.3. Compute Compute p p = = PPKWKW(C(C, , Z)Z); ;

let CQSlet CQS(C(C, , AA) ) = −log = −log pp. .

4.4. Estimate the statistical conEstimate the statistical confifidence of the dence of the result by perturbations on result by perturbations on CC. .

Results on simulated Results on simulated datadata

The effect of one-dimensional projection.The effect of one-dimensional projection. The effect of solution accuracy on CQS.The effect of solution accuracy on CQS. Sensitivity of CQS to the number of Sensitivity of CQS to the number of

clusters.clusters. CQS ability to detect fine clustering CQS ability to detect fine clustering

structures.structures.

SimulationSimulation

80 binary attributes.80 binary attributes. 5 groups of n=50 genes each.5 groups of n=50 genes each. For each attribute randomly selected one For each attribute randomly selected one

group in which its frequency will be group in which its frequency will be r.r. In the other 4 groups its freq. was set to In the other 4 groups its freq. was set to rr00.. TThe set of he set of r (rr (r00)) genes with that attribute was genes with that attribute was

randomly selected from the relevant groupsrandomly selected from the relevant groups.. TThe larger the difference between he larger the difference between r r and and rr00, the , the

easier the distinction between the groups. easier the distinction between the groups.

The effect of one-dimensional The effect of one-dimensional projectionprojection

Was simulated data with Was simulated data with r r == 6,10,15,20 and 25.6,10,15,20 and 25. WasWas computed the ratio of separation to computed the ratio of separation to

homogeneity of the true clustering on the homogeneity of the true clustering on the original data (S/original data (S/ H) andH) and on the projected data on the projected data (S(S**// HH**).).

The procedure was repeated 10 times.The procedure was repeated 10 times.

The effect of one-dimensional The effect of one-dimensional projectionprojection - results - results

Was found the monotonicity of the homogeneity, Was found the monotonicity of the homogeneity, separation and their ratio as a function of separation and their ratio as a function of r r on the on the original data and on the reduced data. original data and on the reduced data.

CQS improves monotonically with CQS improves monotonically with r.r.

Box plots for the projection of 5 simulated Box plots for the projection of 5 simulated

clusters after dimensionality reductionclusters after dimensionality reduction

The projected data with The projected data with r r = 6= 6.. TThe clusters look very similar, he clusters look very similar,

even though the data were even though the data were reduced using the best reduced using the best separating linear combination. separating linear combination.

The projected data with The projected data with r r = = 25.25.

IInter-cluster separation of nter-cluster separation of most clusters is clearly visible. most clusters is clearly visible.

The The yy-axis is the real-valued projection of the elements. Each box-plot depicts the median of -axis is the real-valued projection of the elements. Each box-plot depicts the median of the distribution (dot), 0.1 and 0.9 distribution quantiles (white box), and the maximum and the distribution (dot), 0.1 and 0.9 distribution quantiles (white box), and the maximum and minimum values. minimum values.

The effect of solution The effect of solution accuracy on CQS - resultsaccuracy on CQS - results

Was compared CQS of the true Was compared CQS of the true partition with that ofpartition with that of other, similar other, similar and remote partitions. and remote partitions.

Those were produced by starting Those were produced by starting with the true solution and with the true solution and repeatedly exchanging a repeatedly exchanging a randomly chosen pair of randomly chosen pair of elements from different clusters. elements from different clusters.

CQS is highest for the true CQS is highest for the true partition and decreases with the partition and decreases with the number of exchanges applied number of exchanges applied (200 exchanges generate an (200 exchanges generate an essentially random partition).essentially random partition).

The Jaccard coefThe Jaccard coeffificients cients decrease with the number of decrease with the number of exchanges. exchanges.

The accuracy of different The accuracy of different clustering solutions is measured clustering solutions is measured by the number of inter-cluster by the number of inter-cluster exchanges introduced in the exchanges introduced in the original solution. original solution.

XX-axis: number of exchanges.-axis: number of exchanges. Y Y -axis: CQS (left scale) and -axis: CQS (left scale) and

Jaccard coefJaccard coeffificient (right scale). cient (right scale).

Sensitivity of CQS to the Sensitivity of CQS to the number of clustersnumber of clusters (splitting) (splitting)

How CQS changes when How CQS changes when splittingsplitting clusters? clusters? Was compared the true 5-cluster solution with a 25-cluster solution Was compared the true 5-cluster solution with a 25-cluster solution

obtained by randomly splitting each of the 5 clusters into 5 equal-size sub-obtained by randomly splitting each of the 5 clusters into 5 equal-size sub-clusters.clusters.

This test was repeated 10 times.This test was repeated 10 times. Was observed a decrease of the clustering quality measures in all runs. Was observed a decrease of the clustering quality measures in all runs. The decrease of The decrease of S/H S/H is maintained in CQS and on the reduced data is maintained in CQS and on the reduced data

((S*/H* S*/H* ). ).

The diagram shows CQS, S*/H* and S/H(y-axis) of the true solution (gray) and the modified solution (black)

Sensitivity of CQS to the Sensitivity of CQS to the number of clustersnumber of clusters (merging) (merging)

How CQS changes when How CQS changes when mergingmerging clusters? clusters? Was simulated two 5-cluster data sets using Was simulated two 5-cluster data sets using n n = 25. Then combined these data = 25. Then combined these data

sets into a single data set whose true solution consists of 10 equal size clusters sets into a single data set whose true solution consists of 10 equal size clusters with 25 genes each. with 25 genes each.

Was merged pairs of clusters, one from each original data set, to form in total 5 Was merged pairs of clusters, one from each original data set, to form in total 5 clusters with 50 genes each. These 5 clusters comprised the alternative (merged) clusters with 50 genes each. These 5 clusters comprised the alternative (merged) solution.solution.

All the measures decrease due to the merging (as in splitting test).All the measures decrease due to the merging (as in splitting test). The decrease of The decrease of S/H S/H is maintained and enlarged in is maintained and enlarged in S*/H* and CQSS*/H* and CQS. .

The diagram shows CQS, S*/H* and S/H(y-axis) of the true solution (gray) and the modified solution (black)

Sensitivity of CQS to the Sensitivity of CQS to the number of clustersnumber of clusters

CQS and Jaccard coefficient ? CQS and Jaccard coefficient ? Was simulated 5-cluster data and applied Was simulated 5-cluster data and applied KK-means to the data, -means to the data,

with with K K = 2, = 2, ... ... ,15.,15. Was computed CQS and Jaccard coefficient for each clustering Was computed CQS and Jaccard coefficient for each clustering

solution.solution.

Sensitivity of CQS to the Sensitivity of CQS to the number of clusters - resultsnumber of clusters - results

CQS behaves as the Jaccard coefCQS behaves as the Jaccard coeffificient and cient and S/H S/H , with , with a maximum at a maximum at K K = 5 (the true number of clusters). = 5 (the true number of clusters).

The ranking of all 14 solutions according to the Jaccard The ranking of all 14 solutions according to the Jaccard score (which is based on the true solution) and score (which is based on the true solution) and according to CQS (which is based on the attributes according to CQS (which is based on the attributes only) are virtually identical. only) are virtually identical.

The ratio score also does quite well, with a maximum The ratio score also does quite well, with a maximum at at K K = 5. However, the ranking of solutions by this = 5. However, the ranking of solutions by this score does not agree with the Jaccard score. score does not agree with the Jaccard score.

CQS ability to detect fine CQS ability to detect fine clustering structuresclustering structures

ProProfifiles of 30 binary attributes were generated for four clusters of les of 30 binary attributes were generated for four clusters of n n = 50 genes each. = 50 genes each.

For each attribute, its frequencies in clusters 1, 2, 3 and 4 were For each attribute, its frequencies in clusters 1, 2, 3 and 4 were set to 2, set to 2, bb,50−,50−b b and 48, respectively. and 48, respectively.

Was simulated data sets with Was simulated data sets with b b = 3, 5, 10, 15, 20. = 3, 5, 10, 15, 20. For each data set, scored two clustering solutions: the original 4-For each data set, scored two clustering solutions: the original 4-

cluster solution, and a 2-cluster solution obtained by merging cluster solution, and a 2-cluster solution obtained by merging cluster 1 with 2 and merging cluster 3 with 4. cluster 1 with 2 and merging cluster 3 with 4.

For large values of For large values of b b we expect the 4-cluster solution to score we expect the 4-cluster solution to score higher than the 2-cluster solution. higher than the 2-cluster solution.

For each data set and each of the two solutions, was computed For each data set and each of the two solutions, was computed S/H S/H , CQS and the average silhouette score. , CQS and the average silhouette score.

CQS ability to detect fine CQS ability to detect fine clustering structures - resultsclustering structures - results

The ratios are increasing with The ratios are increasing with b b in all scores. in all scores.

The silhouette for the 2-cluster The silhouette for the 2-cluster solution is always greater than solution is always greater than for the corresponding 4-cluster for the corresponding 4-cluster solution. solution.

For For bb=3,5,10,15, =3,5,10,15, S/H S/H is greater is greater for the incorrect 2-cluster for the incorrect 2-cluster solution. solution.

CQS is able to identify the CQS is able to identify the fifine ne structure in the data: for all structure in the data: for all b b values except values except b b = 3. = 3.

For For b b = 3, the 2-cluster CQS is = 3, the 2-cluster CQS is higher than the 4-cluster CQS, higher than the 4-cluster CQS, since there is almost no since there is almost no difference between the clusters difference between the clusters with 2 or 3 occurrences of with 2 or 3 occurrences of attributes, and between the attributes, and between the clusters with 47 or 48 clusters with 47 or 48 occurrences. occurrences.

SilhouetteSilhouette

S / HS / H

CQSCQS

Yeast cell-cycle dataYeast cell-cycle data

The tested data is from yeast cell-cycle data.The tested data is from yeast cell-cycle data. The data set contains 698 genesThe data set contains 698 genes and 72 and 72

conditions. conditions. Each row of the 698×72 matrix was normalized Each row of the 698×72 matrix was normalized

to have mean 0 and variance 1.to have mean 0 and variance 1. We expect to We expect to fifind in the data 5 main clusters.nd in the data 5 main clusters. The 698×72 data set was clustered using four The 698×72 data set was clustered using four

clustering methods: clustering methods: KK-means, SOM, CAST and -means, SOM, CAST and CLICK.CLICK.

Yeast cell-cycle dataYeast cell-cycle data

The results are:The results are:

K-means : 5 clustersK-means : 5 clusters

SOM : 6 clustersSOM : 6 clusters

CAST : 5 clustersCAST : 5 clusters

CLICK : 6 clusters + 23 singletonsCLICK : 6 clusters + 23 singletons Also, the genes was manually divided into 5 groups Also, the genes was manually divided into 5 groups

using their peak of expressionusing their peak of expression – the ‘true’ solution.– the ‘true’ solution. Was scored a random clustering of the data into 5 Was scored a random clustering of the data into 5

equal-size clusters.equal-size clusters.

Yeast cell-cycle dataYeast cell-cycle data

As gene attributes was used the GO classes As gene attributes was used the GO classes and MIPS annotationand MIPS annotation..

Overall was used: 51 GO process attributes, 37 GO Overall was used: 51 GO process attributes, 37 GO function attributes, 27 GO component attributes and 59 function attributes, 27 GO component attributes and 59 MIPS attributes.MIPS attributes.

CQS was computed 3 times: using the GO process CQS was computed 3 times: using the GO process attributes only, all GO attributes, and the MIPS attributes only, all GO attributes, and the MIPS attributes only.attributes only.

The experiment shows thatThe experiment shows that different biological different biological attributes lead to different evaluations of clustering attributes lead to different evaluations of clustering solutions. solutions.

Yeast cell-cycle dataYeast cell-cycle data Clustering solutions is computed using GO level 5 Clustering solutions is computed using GO level 5

process attributes only:process attributes only:

Yeast cell-cycle dataYeast cell-cycle data Clustering solutions is computed using all GO level 5 Clustering solutions is computed using all GO level 5

attributes :attributes :

Yeast cell-cycle dataYeast cell-cycle data Clustering solutions is computed using MIPS level 4 Clustering solutions is computed using MIPS level 4

attributes :attributes :

Yeast cell-cycle dataYeast cell-cycle data

The 22 most enriched attributes for CLICK The 22 most enriched attributes for CLICK solution using all GO attributes:solution using all GO attributes:

Yeast cell-cycle dataYeast cell-cycle data

The distribution and co-occurrence of the attributes The distribution and co-occurrence of the attributes ‘DNA metabolism’, ‘DNA replication’ and ‘Chromo some ‘DNA metabolism’, ‘DNA replication’ and ‘Chromo some organization’ in the 6 clusters of the CLICK solutionorganization’ in the 6 clusters of the CLICK solution ::

ConclusionConclusion

CQS method based on biological relevance using CQS method based on biological relevance using attributes of the clustered elements, which are attributes of the clustered elements, which are available independently from the data used to available independently from the data used to generate the clusters.generate the clusters.

The method can be applied :The method can be applied :-- to compare the functional enrichment of many to compare the functional enrichment of many biological attributes simultaneously in different biological attributes simultaneously in different

clustering solutions.clustering solutions.-- to optimize the parameters of a clustering to optimize the parameters of a clustering

algorithmalgorithm ..

ConclusionConclusion

The method outperforms previous numeric methods The method outperforms previous numeric methods (the S/H ratio and the average silhouette measure).(the S/H ratio and the average silhouette measure).

CQS is sensitive to small modifications of the clustering CQS is sensitive to small modifications of the clustering solution and to changes in the simulation setting.solution and to changes in the simulation setting.

The attribute weights were computed using information The attribute weights were computed using information about all the attributes together, without assuming that about all the attributes together, without assuming that the attributes are independent.the attributes are independent.

CQS has the advantage that it can use such CQS has the advantage that it can use such continuous data without any assumption on the data continuous data without any assumption on the data distribution.distribution.