Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in...
Transcript of Data Mining in Bioinformatics Day 7: Clustering in ... · Karsten Borgwardt: Data Mining in...
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 7: Clustering in Bioinformatics
Karsten Borgwardt
February 21 to March 4, 2011
Machine Learning & Computational Biology Research GroupMPIs Tübingen
Clustering in bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
MicroarraysClustering is a widely used tool in microarray analysisClass discovery is an important problem in microarraystudies for two reasons:
either the classes are completely unknown before-handor it is unknown whether a known class contains inter-esting subclasses
Clustering in bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
ExamplesClasses unknown:
Does a disease affect gene expression in a particulartissue?Does gene expression differ between two groups in aparticular condition?
Subclasses unknown:Are there subtypes of a disease?Is there even a hierarchy of subclasses within one dis-ease?
Clustering in bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
PopularityClustering tools are available in the large microarraydatabase NCBI Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/3002 pubmed hits for ’microarray clustering’Recent editorial of OUP Bioinformatics
Distance metrics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
Euclidean distanceEuclidean distance of gene x and y of n samples or sam-ple x and y of n genes:
dxy =
√√√√ n∑i=1
(xi − yi)2 (1)
Pearson’s Correlation
Pearson Correlation of gene x and y of n samples orsample x and y of n genes, where x is the mean of xand is y the mean of y:
rxy =
∑ni=1(xi − x)(yi − y)√∑n
i=1(xi − x)2√∑n
i=1(yi − y)2(2)
Distance metrics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
Un-centered correlation coefficientUn-centered correlation coefficient of gene x and y of nsamples or sample x and y of n genes:
ruxy =
∑ni=1 xiyi√∑n
i=1 x2i
√∑ni=1 y
2i
(3)
Clustering algorithms
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Hierarchical ClusteringSingle linkage: The linking distance is the minimum dis-tance between two clusters.Complete linkage: The linking distance is the maximumdistance between two clusters.Average linkage/UPGMA (The linking distance is the av-erage of all pair-wise distances between members of thetwo clusters. Since all genes and samples carry equalweight, the linkage is an Unweighted Pair Group Methodwith Arithmetic Means (UPGMA))
‘Flat’ Clusteringk-means (k from 2 to 15, 3 runs)k-median (k-medoid)
The two-sample problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
Interpretation of clustersClustering introduces ‘structure’ into microarraydatasetsBut is there a statistical or biomedical meaning of theseclasses?Biomedical meaning has to be established in experi-ments‘Statistical meaning’ can be measured using statisticaltests, by a so-called two-sample test
A two-sample tests decides whether two samples weredrawn from the same probability distribution or not
The two-sample problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
Data diversityMolecular biology produces a wealth of informationThe problem is that these data are generated
on different platforms andby different protocolsunder different levels of noise
Hence data from different labs showdifferent scalesdifferent rangesdifferent distributions
Main problem:Joint data analysis may detect differences in distribu-tions, not biological phenomena!
The two-sample problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
The two-sample problemGiven two samples X and Y .Were they generated by the same distribution?
Previous approachestwo-sample tests exist for univariate and multivariatedata
The two-sample problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
t-testA test of the null hypothesis that the means of two nor-mally distributed populations are equalunpaired/independent (versus paired)For equal sample sizes and equal variances, the t statis-tic to test whether the means are different can be calcu-lated as follows:
t =x− y
σxy ·√
2n
(4)
where σxy =√
σ2x+σ2y
2 .The degrees of freedom for this test is 2n− 2 where n isthe size of each sample.
The two-sample problem
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
New challenges in bioinformaticshigh-dimensionalstructured (strings and graphs)low sample size
Novel distribution test: Maximum Mean Discrepancy(MMD)
MMD key idea
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
MMD key idea
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Key IdeaAvoid density estimator, use means in feature spacesMaximum Mean Discrepancy (Fortet and Mourier, 1953)
D(p, q,F) := supf∈F
Ep [f (x)]− Eq [f (y)]
TheoremD(p, q,F) = 0 iff p = q, when F = C0(X).
Follows directly, e.g. from Dudley, 1984.
TheoremD(p, q,F) = 0 iff p = q, when F = {f | ‖f‖H ≤ 1}provided that H is a universal RKHS.
(follows via Steinwart, 2001, Smola et al., 2006).
MMD statistic
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Goal: Estimate D(p, q,F)
Ep,pk(x, x′)− 2Ep,qk(x, y) + Eq,qk(y, y′)
U-Statistic: Empirical estimate D(X, Y,F)
1m(m−1)
∑i 6=j
k(xi, xj)− k(xi, yj)− k(yi, xj) + k(yi, yj)
TheoremD(X, Y,F) is an unbiased estimator of D(p, q,F).
TestEstimate σ2 from data.Reject null hypothesis that p = q if D(X, Y,F) exceedsacceptance threshold.
Attractive for bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
MMDtwo-sample test in terms of kernels
Computationally attractivesearch infinite space of functions by evaluating one ex-pressionno optimization problem has to be solved
All thanks to kernels!
Attractive for bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
Wide applicabilityfor one- and higher-dimensional vectorial data,but also for structured data!two-sample problems can now be tackled on
strings: protein and DNA sequencesgraphs: molecules, protein interaction networkstime series: time series of microarray dataand sets, trees, . . .
Cross-platform comparability
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
Datamicroarray data from two breast cancer studiesone on cDNA platform (Gruvberger et al., 2001)other on oligonucleotide microarray platform (West etal., 2001)
TaskCan MMD help to find out if two sets of observationswere generated bythe same study (both from Gruvberger or both fromWest)?different studies (one Gruvberger, one West)?
Cross-platform comparability
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
Experimentsample size each: 25dimension of each datapoint 2,116significance level: α = 0.05
100 times: 1 sample from Gruvberger, 1 from West100 times: both from Gruvberger or both from Westreport percentage of correct decisionscompare to t-test, Friedman-Rafsky Wald-Wolfowitz andSmirnov
Cross-platform comparability
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Kernel-based statistical test
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
novel statistical test for two-sample problem:
easy to implementnon-parametricfirst for structured databest on high-dimensional dataquadratic runtime w.r.t. the number of data pointsimpressive accuracy in our experiments
kernel method for two-sample problem:
all kernels recently defined in molecular biology can bere-used for data integrationapplicable to vectors, strings, sets, trees, graphs andtime series
Biclustering
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Clustering in two dimensionsalternative names: co-clustering, two-mode clusteringA bicluster is a subset of genes that show similar activ-ity patterns under a subset of conditions.Clustering in 2 dimensionsCluster patients and conditionsEarliest work by Hartigan, 1972: Divide a matrix intosubmatrices with minimum variance.Most interesting cases are NP-complete.Many extensions in bioinformatics (e.g. Cheng andChurch, 2002)
References and further reading
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
References
[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernelmethod for the two-sample problem. NIPS 2006
The end
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
See you tomorrow! Next topic: Feature Selection inBioinformatics