Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in...

Karsten Borgwardt: Data Mining in Bioinformatics,

Data Mining in BioinformaticsDay 7: Clustering in Bioinformatics

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

Clustering in bioinformatics


MicroarraysClustering is a widely used tool in microarray analysisClass discovery is an important problem in microarraystudies for two reasons:

either the classes are completely unknown before-handor it is unknown whether a known class contains inter-esting subclasses



ExamplesClasses unknown:

Does a disease affect gene expression in a particulartissue?Does gene expression differ between two groups in aparticular condition?

Subclasses unknown:Are there subtypes of a disease?Is there even a hierarchy of subclasses within one dis-ease?



PopularityClustering tools are available in the large microarraydatabase NCBI Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/3002 pubmed hits for ’microarray clustering’Recent editorial of OUP Bioinformatics

Distance metrics


Euclidean distanceEuclidean distance of gene x and y of n samples or sam-ple x and y of n genes:

dxy =

√√√√ n∑i=1

(xi − yi)2 (1)

Pearson’s Correlation

Pearson Correlation of gene x and y of n samples orsample x and y of n genes, where x is the mean of xand is y the mean of y:

rxy =

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2√∑n

i=1(yi − y)2(2)

Distance metrics


Un-centered correlation coefficientUn-centered correlation coefficient of gene x and y of nsamples or sample x and y of n genes:

ruxy =

∑ni=1 xiyi√∑n

i=1 x2i

√∑ni=1 y

2i

(3)

Clustering algorithms


Hierarchical ClusteringSingle linkage: The linking distance is the minimum dis-tance between two clusters.Complete linkage: The linking distance is the maximumdistance between two clusters.Average linkage/UPGMA (The linking distance is the av-erage of all pair-wise distances between members of thetwo clusters. Since all genes and samples carry equalweight, the linkage is an Unweighted Pair Group Methodwith Arithmetic Means (UPGMA))

‘Flat’ Clusteringk-means (k from 2 to 15, 3 runs)k-median (k-medoid)

The two-sample problem


Interpretation of clustersClustering introduces ‘structure’ into microarraydatasetsBut is there a statistical or biomedical meaning of theseclasses?Biomedical meaning has to be established in experi-ments‘Statistical meaning’ can be measured using statisticaltests, by a so-called two-sample test

A two-sample tests decides whether two samples weredrawn from the same probability distribution or not



Data diversityMolecular biology produces a wealth of informationThe problem is that these data are generated

on different platforms andby different protocolsunder different levels of noise

Hence data from different labs showdifferent scalesdifferent rangesdifferent distributions

Main problem:Joint data analysis may detect differences in distribu-tions, not biological phenomena!



The two-sample problemGiven two samples X and Y .Were they generated by the same distribution?

Previous approachestwo-sample tests exist for univariate and multivariatedata



t-testA test of the null hypothesis that the means of two nor-mally distributed populations are equalunpaired/independent (versus paired)For equal sample sizes and equal variances, the t statis-tic to test whether the means are different can be calcu-lated as follows:

t =x− y

σxy ·√

2n

(4)

where σxy =√

σ2x+σ2y

2 .The degrees of freedom for this test is 2n− 2 where n isthe size of each sample.



New challenges in bioinformaticshigh-dimensionalstructured (strings and graphs)low sample size

Novel distribution test: Maximum Mean Discrepancy(MMD)

MMD key idea


MMD key idea


Key IdeaAvoid density estimator, use means in feature spacesMaximum Mean Discrepancy (Fortet and Mourier, 1953)

D(p, q,F) := supf∈F

Ep [f (x)]− Eq [f (y)]

TheoremD(p, q,F) = 0 iff p = q, when F = C0(X).

Follows directly, e.g. from Dudley, 1984.

TheoremD(p, q,F) = 0 iff p = q, when F = {f | ‖f‖H ≤ 1}provided that H is a universal RKHS.

(follows via Steinwart, 2001, Smola et al., 2006).

MMD statistic


Goal: Estimate D(p, q,F)

Ep,pk(x, x′)− 2Ep,qk(x, y) + Eq,qk(y, y′)

U-Statistic: Empirical estimate D(X, Y,F)

1m(m−1)

∑i 6=j

k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)

TheoremD(X, Y,F) is an unbiased estimator of D(p, q,F).

TestEstimate σ2 from data.Reject null hypothesis that p = q if D(X, Y,F) exceedsacceptance threshold.

Attractive for bioinformatics


MMDtwo-sample test in terms of kernels

Computationally attractivesearch infinite space of functions by evaluating one ex-pressionno optimization problem has to be solved

All thanks to kernels!

Attractive for bioinformatics


Wide applicabilityfor one- and higher-dimensional vectorial data,but also for structured data!two-sample problems can now be tackled on

strings: protein and DNA sequencesgraphs: molecules, protein interaction networkstime series: time series of microarray dataand sets, trees, . . .

Cross-platform comparability


Datamicroarray data from two breast cancer studiesone on cDNA platform (Gruvberger et al., 2001)other on oligonucleotide microarray platform (West etal., 2001)

TaskCan MMD help to find out if two sets of observationswere generated bythe same study (both from Gruvberger or both fromWest)?different studies (one Gruvberger, one West)?



Experimentsample size each: 25dimension of each datapoint 2,116significance level: α = 0.05

100 times: 1 sample from Gruvberger, 1 from West100 times: both from Gruvberger or both from Westreport percentage of correct decisionscompare to t-test, Friedman-Rafsky Wald-Wolfowitz andSmirnov

Kernel-based statistical test


novel statistical test for two-sample problem:

easy to implementnon-parametricfirst for structured databest on high-dimensional dataquadratic runtime w.r.t. the number of data pointsimpressive accuracy in our experiments

kernel method for two-sample problem:

all kernels recently defined in molecular biology can bere-used for data integrationapplicable to vectors, strings, sets, trees, graphs andtime series

Biclustering


Clustering in two dimensionsalternative names: co-clustering, two-mode clusteringA bicluster is a subset of genes that show similar activ-ity patterns under a subset of conditions.Clustering in 2 dimensionsCluster patients and conditionsEarliest work by Hartigan, 1972: Divide a matrix intosubmatrices with minimum variance.Most interesting cases are NP-complete.Many extensions in bioinformatics (e.g. Cheng andChurch, 2002)

References and further reading


References

[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernelmethod for the two-sample problem. NIPS 2006

The end


See you tomorrow! Next topic: Feature Selection inBioinformatics

Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in...

Documents

Transcript of Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in...