ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2...

ICA-based Clustering of Genes from ICA-based Clustering of Genes from Microarray Expression DataMicroarray Expression Data

Su-In Lee1, Serafim Batzoglou2 silee@stanford.ed, serafim@cs.stanford.edu1Department of Electrical Engineering, 2Department of Computer Science, Stanford University

1. ABSTRACT To cluster genes from DNA microarray, an unsupervised methodology using independent component analysis (ICA) is proposed. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or down-regulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets.

2. GENE EXPRESSION MODELExpression pattern of genes in a certain condition is a composite effect of independent biological processes that are active in that condition. For example, suppose that there are 9 genes and 3 biological processes taking place inside a cell.

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9

Ribosome Biosynthesis

Oxidative Phosphorylation

Cell Cycle Regulation

Ribosome Biosynthesis

In an Experimental Condition

Observed genomic expression pattern can be seen as a combinational effect of genomic expression programs of biological processes that are active in that condition.

Genome

messenger RNA

Each biological process becomes active by turning on genes associated with the processes.

We can measure expression level of genes using Microarray.

3. Microarray DataMicroarray Data display expression levels of a set of genes measured in various experimental conditions.

Expression Levels of aGene Gi across Experimental Conditions

Expression Patterns of Genes under an Experimental Condition Expi

Exp 1Exp 2Exp 3

G1 G2 GN-1GN

ExamplesHeat shock, G phase in cell cycle, etc … conditionsLiver cancer patient, normal person, etc … samples

4. Mathematical ModelingThe expression measurement of K genes observed in three conditions denoted by x1, x2 and x3 can be expressed as linear combinations of genomic expression programs of three biological processes denoted by s1, s2 and s3.

Ribosome Biogenesis

Heat Shock

Starvation

Hyper-Osmotic Shock

Unknown Mixing System

Genomic Expression Programs of Biological Processes

Genomic Expression Pattern in Certain Experimental Conditions

Given a microarray dataset, can we recover genomic expression programs of biological processes?

In other words, can we decompose a matrix X into A and S so that each row of S represents a genomic expression program of a biological process?

5. ICA AlgorithmUsing the log-likelihood maximization approach, we can find W that maximizes log-likelihood L(y,W).

8. Microarray DatasetsFor testing, five microarray datasets were used and for each dataset, the clustering performance of our approach was compared with another approach applied to the same dataset.

iii ypWxpWyL

)(log|)det(|log)(log),(

yi’s are assumed to be statistically independent

iii ypyp

TT xyWW

WyLW )()(

,...)(

Prior information on ySuper-Gaussian or Sub-Gaussian ?

)(|)det(|)( ypWxp

7. Measuring significance of ICA-based clustersStatistical significance of biological coherence of clusters was measure using gene annotation databases like Gene Ontology (GO).6. ICA-based Clustering

Step 1 Apply ICA to microarray data X to obtain YStep 2 Cluster genes based on independent components, rows of Y.

Based on our gene expression model, Independent Components y1,…, yn are assumed to be expression programs of biological processes. For each yi, genes are ordered based on activity levels on yi and C% (C=7.5) showing significantly high/low level are grouped into each cluster.

Cluster 1Cluster 2

Cluster 3Cluster n

GO 1 GO 2

GO iGO m

Clusters from ICA GO categories

Cluster i GO jk genes

For every combination of our cluster and a GO category, we calculated the p-value, a change probability that these two clusters share the observed number of genes based on the hypergeometric distribution.

g: # of genes in all clusters and GOsf: # of genes in the GO jn: # of genes in the Cluster ik: # of genes GO j and Cluster i share

9. ResultsFor each method, the minimum p-values (<10-7) corresponding to each GO functional class were collected and compared.

ID Description Genes Exps Compared with D1 Yeast during cell cycle 5679 22 PCAD2 Yeast during cell cycle 6616 17 k-means clusteringD3 Yeast under stressful conditions 6152 173 Bayesian approach

Plaid modelD4 C.elegans in various conditions 17817 553 Topomap approachD5 19 kinds of normal Human tissue 7070 59 PCA

ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2...

Documents

Transcript of ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2...

Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.

cs.stanford.edu · cs.stanford.edu ... q

ARACHNE: A Whole-Genome Shotgun Assembler · ARACHNE: A Whole-Genome Shotgun Assembler Serafim Batzoglou,1,2,3 David B. Jaffe,2,3,4 Ken Stanley,2 Jonathan Butler,2 Sante Gnerre, 2Evan

Application of independent component analysis to ... · PDF fileSu-In Lee* and Serafim Batzoglou

BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

Hidden Markov Models - courses.cs.washington.eduSimma, Erik Sudderth, David Fernandez-Baca, Drena Dobbs, Serafim Batzoglou, William Cohen, Andrew McCallum, Dan Weld) Hidden Markov

Http://cs273a.stanford.edu [Bejerano Fall09/10] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Aaron Wenger & Gus Katsiapis.

[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.

texto serafim

CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist email: cs262.win08@gmail.com Monday & Wednesday.

Http://cs273a.stanford.edu [BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.

FloatingTextures - cs.stanford.edu

Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Cartea Facerii (Serafim Rose)

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium.

Http://cs273a.stanford.edu [Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.

Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,

Http://cs273a.stanford.edu [BejeranoFall14/15] 1 MW 12:50-2:05pm in Beckman B100 Profs: Serafim Batzoglou & Gill Bejerano CAs: Jim Notwell & Sandeep Chinchali.

Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS serafim@cs.stanford.edu, x3-3334 Office hours: Monday 2:00-3:30 TA: