Microarrays, Expression, and Regulatory Networks

Thanks to Prof. Mehmet Koyuturk, Case Western Reserve University.

Central Dogma

3. DNA Microarrays2

A functional protein (or sometimes, RNA) that is coded by a particular gene is often called the product of that gene

Gene Expression

3. DNA Microarrays

Gene expression is the process of synthesizing a functional gene product (protein or RNA) from a segment of DNA that specifies inheritable information ( a gene)

In a multicellular organism, all cells contain identical genomes, but different genes are expressed in different types of cells

Regulation of gene expression Development, response to environmental signals

Studying Genome-wide Expression

3. DNA Microarrays

The types and of expression levels of prescribed genes are linked to the phenotype of a cell The concentration of mRNA corresponding to each

gene in the genome provides a measure of gene expression

Gene expression is regulated at various stages, so mRNA concentration is not necessarily a perfect indicator of the activity of a gene (in terms of the activity of gene product) Splice variants Post-translational modification Proteomics

Research Questions

3. DNA Microarrays

Knowledge of genome-wide expression makes it possible to study fundamental questions related to gene expression What genes are expressed in what cell types (e.g.,

different tissues)? How does gene expression change over time (e.g.,

cell cycle)? How does a certain type of disease influence the

expression of one or more genes to alter phenotype?

How do the expression levels of different groups of genes change under different conditions?

How do genes regulate each other’s expression?

DNA Microarray Technology

3. DNA Microarrays

Measure the amount of mRNA (corresponding to each gene) existing in a given cell in bulks (for thousands of genes)

The major tool in transcriptomics It is possible to measure the expression of a large

number of genes (often, the entire genome) in a single sample

Makes it possible to compare the expression levels of several genes in one sample

Large-scale application of traditional techniques Hybridization-based methods

Hybridization

3. DNA Microarrays

The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule

Key idea in microarray technology

What is a DNA Microarray?

3. DNA Microarrays

A DNA microarray is a slide onto which a regular pattern of spots is deposited Each spot contains many copies of a specified

single-stranded DNA sequence (i.e., multiple biologically identical sequences)

All sequences are chemically bonded to the surface of the slide

There is a different DNA sequence at each spot (i.e., the sequences at different spots are biologically different)

Spots are small Can fit thousands of spots on a single slide a few

centimeters across

How do DNA Microarrays work?

3. DNA Microarrays

The DNA sequences in spots act as probes that hybridize with complementary sequences If complementary sequence exists in the sample,

the corresponding DNA sequence in a spot will hybridize

Solutions extracted from tissue samples contain large numbers of mRNAs of many different types that happen to be present in the cells at the time of the experiment The amount of mRNA hybridized in a spot provides

an estimate of the concentration of the mRNA in the sample

Each spot targets a different type of mRNA (gene)

Measuring mRNA Concentration

3. DNA Microarrays

The spots in which hybridization takes place can be visualized using fluorescence techniques Sequences in the sample are fluorescently labeled The DNA that hybridizes is visually identifiable as

glowing spots on the array Spots that have nothing hybridized are not visible The intensity of fluorescence at each spot is

proportional to the amount of the corresponding type of mRNA in the sample

DNA microarrays can detect presence of sequences corresponding to all spots simultaneously

Types of Microarrays

3. DNA Microarrays

1. Oligonucleotide arrays2. cDNA arrays

Oligonucleotide Arrays

3. DNA Microarrays

Oligo: Just a few, scanty Oligonucleotide arrays use short DNA

sequences (in the spots) Usually 25 nucleotides Several spots correspond to one gene

Oligonucleotides are synthesized in situ based on the sequences One base at a time

Sometimes called a chip Commercially manifactured by Affymatrix

Production of Oligonucleotide Arrays

3. DNA Microarrays13

Photolitography & combinatorial chemistry

Hybridization Specificity

3. DNA Microarrays

Each oligonucleotide should hybridize to a specific gene in the organism Short sequences => Cross hybridization is more probable There are a lot of genes in an organism with related

sequences Perfect match/Mismatch (PM/MM) probe strategy

Two spots for each oligonucleutide PM: Identical to the target MM: Differs only at the base in the middle of sequence Assumption: non-specific binding is identical for PM

and MM probes PM-MM provides a measure of specific hybridization

Use of Oligonucleotide Arrays

cDNA Arrays

3. DNA Microarrays

A cDNA is a DNA strand synthesized using a reverse transcriptase enzyme, which makes a DNA sequence that is complementary to an RNA template Reverse of what happens in transcription It is possible to synthesize cDNAs from mRNAs present in

cells There are cDNA libraries that contain sequences of genes

known to be expressed in particular cell types Use cDNA sequences as probe sequences on

microarrays Knowledge of sequence is not necessary Experimentally identifying a set of suitable cDNAs is

sufficient

cDNA Arrays

3. DNA Microarrays

cDNAs are quite long: 500-2000 bases Hybridization is much more specific A cDNA contains a large fraction of a gene sequence,

but not necessarily the entire gene Generally, one spot is adequate to recognize a single

gene The process of array manufacture is less

reproducible It is not easy to control the amount of DNA at

each spot It is not usually possible to compare absolute

intensities of spots from different slides Use two samples on one array!

Two-color Hybridization

3. DNA Microarrays

Two samples One test sample One control (reference) sample

Prepare RNA extracts from each sample separately

Make cDNA from each sample using nucleotides labeled with a different color Reference sample: Green (Cy5) Test sample: Red (Cy3)

Mix labeled populations, let mixture to hybridize with array cDNAs from different samples should bind to spot

in proportion to their concentrations

Two-Color Hybridization

3. DNA Microarrays

Red spot: The gene is expressed significantly more in the test sample

Green spot: The gene is expressed significantly more in the reference sample

Yellow spot: The gene has about the same expression level in both samples

The fraction of red intensity to green intensity provides a measure of relative expression

Use of cDNA Arrays

Use of Two-Color Hybridization

3. DNA Microarrays

Compare cell/tissue samples Cells before and after an experimental

perturbation Successive times during a temporally staged

process Between stages of differentiation Mutant cell vs. wild type

How do we compare multiple samples? Time-course experiments

Comparing Multiple Samples

3. DNA Microarrays

Choose a single reference sample Need not be related to samples being examined Time course experiments: Initial sample

Since the concentration of each mRNA in the reference sample is mixed, the relative expression with respect to reference sample provides a fair comparison between all other samples

Reference sample should provide a hybridization signal for each gene (should have non-zero mRNA concentration) Approximation to ideal reference sample: Equal

mixture of material from all samples

Oligonucleotide vs. cDNA Arrays

3. DNA Microarrays

cDNA does not require probe design cDNA provides higher specificity due to longer

sequences of targets However, cDNA may contain repetitive sequences

that are often obtained in various genes Techniques like PM/MM enhance specificty of

oligonucleotides cDNA arrays are more useful on a global level

Screening steady-state mRNA expression levels Oligonucleotide arrays are more useful when

more precise analysis is required SNPs

Relative Expression

3. DNA Microarrays

Ri : Red intensity (test sample)

Gi : Green intensity (reference sample)

Intensity ratio: Ti = Ri / Gi If > 1, the gene is up-regulated in the test sample If < 1, the gene is down-regulated in the test

sample Eliminates spot-to-spot variability to a certain

extent

Channel Normalization

3. DNA Microarrays

There are millions of individual mRNA molecules in one sample It can be assumed that the average mass of each

molecule is approximately the same It can be assumed that arrayed elements

represent a random sampling of the genes in the organism

We use two samples of equal mass, so the total hybridization intensities should be the same

log Ratio

3. DNA Microarrays

Mi = log2(Ri / Gi) log-transformation makes the distribution closer to

normal distribution Mi = 1 => gene i’s expression level is doubled Mi = -1 => gene i’s expression level is halved Mi = 0 => gene i’s expression level is unchanged

Average Intensity of a Spot

3. DNA Microarrays

1log log

2i i i i iA R G R G

log-scaled geometric average of the intensities for the test and reference samples

A measure of the overall expression of a gene

Ratio/Intensity Plot

3. DNA Microarrays

x axis: overall expression of a geney axis: change in expression of a gene (across samples)

Mean Relative Intensity

3. DNA Microarrays

For a gene whose expression level has not changed, we expect that Ri / Gi so that Mi = 0 Most genes should have unchanged expression level In our example, most points are below the horizonal axis This is likely to be because of a systematic bias, rather

than suggesting that most genes are down-regulated in the experiment

Dye bias Efficiency of labeling in two DNA populations may be

different Binding between DNA and probe may be affected by the

dye in a systematic way Efficiency of detecting flourescent signal may be

different

Array Normalization

3. DNA Microarrays

Used to minimize systematic variations in the gene expression levels of the two samples hybridized to the array and allows comparison of gene expression levels across multiple slides

Main assumption: After log-transformation the distribution of relative intensity values approach a normal distribution

Housekeeping Genes

3. DNA Microarrays

Normalize using housekeeping genes A housekeeping gene is one that is assumed to be

expressed at a constant level that does not change between reference and test samples

Shift data so that we will have Mi = 0 for housekeeping genes

It is not easy to find genes whose expression will surely remain unchanged

Global Normalization

3. DNA Microarrays

Subtract the mean relative intensity over all spots from all spots so that the mean will be zero

All these methods are global in the sense that they only change the position of the cloud of points in the M/A plot, not the shapeˆ

i iM M M 1

Self-Normalization

3. DNA Microarrays

Dye-flip experiments Another way of eliminating dye bias Perform a second experiment by in which the red

and green labeling of samples is done in reverse Subtract Mi values from each other Result will be twice the unbiased Mi value, since the

term that corresponds to bias will be canceled out The normalized value of each spot depends only

on the measured intensity ratios for that spot Bias is assumed to be independent in all spots Bias is assumed to be reproducible between arrays

Intensity-Dependent Bias

3. DNA Microarrays

Bias may depend on the average intensity on a spot

In our example, there is an upward trend in the Mi values for higher values of Ai

Whether a gene (on a global sense) is up- or down- regulated should not depend on its average expression level Fluorescence detector may be

saturated at high intensity

LOWESS

3. DNA Microarrays

LOcally WEighted Scatterplot Smoothing Fit a smooth curved function m(A) through the

data points This is an estimate of bias as a function of average

intensity Correct values as

The shift depends on the average intensity on the spot, but the function that determines shift is global Neither global nor self-normalization

ˆ ( )i i iM M m A

Normalization by LOWESS

3. DNA Microarrays

Gene Normalization

3. DNA Microarrays

Array normalization makes arrays cross-comparable

Two identically expressed genes in terms of Cy5 intensities may end up having different log ratios

Solution: Center expression values for each gene so that each gene will have mean (or median) expression value of 0

Example (on blackboard)

Gene Expression Matrix

3. DNA Microarrays

Samples

Now, we are ready to analyze our data!

4. Gene Expression Data Analysis

Analyzing Gene Expression Data

Clustering How are genes related in terms of their expression

under different conditions? Differential gene expression

Which genes are affected by change in condition, tissue, disease?

Classification (supervised analysis) Given expression profile for a gene, can we assign a

function? Given the expression levels of several genes in a

sample, can we characterize the type of sample (e.g., cancerous or normal)?

Regulatory network inference How do genes regulate each others expression to

orchestrate cellular function?

Clustering

Group similar items together Clustering genes based on their expression

profiles We can measure the expression of multiple genes

in multiple samples Genes that are functionally related should have

similar expression profiles Gene expression profile

A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample

Clustering of multi-dimensional real-valued data is a well-studied problem

Motivating Example

Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS,

Applications of Clustering

Functional annotation If a gene with unknown function is clustered

together with genes that perform a particular function, then that is likely to be associated with that function

Identification of regulatory motifs If a group of genes are co-regulated, then it is

likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)

Modular analysis

Gene Expression Matrix

n samples Generally, m >> n

m = O(103) n = O(101)

Each row is an n-dimensional vector

Expression profile

Tiniii

njmieE

],...,,[

1 ,1 ],[

Proximity Measures

How do we decide which genes are similar to each other?

Euclidian distance

Manhattan distance

kjkikjiji eeeeeeEuclidian

2)(),(

| |),(1

kikjiji eeeeee tanManhat

Distance

Minkowski distance General version of Euclidian, Manhattan etc.

p is a parameter

pjkikpjiji eeeeeeMinkowski

jkiknk

ji eeee 1

Normalization

If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene

Tiniii

kiikii

eeeeee

,],...,,[

Correlation

The similarity between the variation of two random variables

A vector is treated as sampling of a random variable

Covariance

kjjkiikji

eeCoveVar

Pearson Correlation Coefficient

Pearson correlation coefficient

Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles

Pearson correlation is normalized

kjjkiik

eVareVar

eeCoveePearson

],[),(

1),(1 ji eePearson

),(),( ''jiji eePearsoneePearson

Euclidian Distance & Correlation

Euclidian distance (normalized) and Pearson correlation coefficient are closely related

These are the two most commonly used proximity measures in gene expression data analysis

Without loss of generality, we will use to denote the distance between two expression profiles

)),( 1(2),( ''jiji eePearsonneeEuclidian

),( jiij ee

Other Measures of correlation

Pearson is vulnerable to outliers If two genes have very high expression in a single

profile, it might dominate to show that the two expression levels are highly correlated

Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them

Pearson is not robust for non-Gaussian distributions Spearman’s rank order correlation coefficient: Rank

expression levels, replace each expression level with its rank

More robust against outliers A lot of loss of information

Clustering Methods

Hierarchical clustering Group genes into a tree

(a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster

Higher branches correspond to coarser clusters

Partitioning Partition genes into several

groups so that similar genes will be in the same partition

Hierarchical clustering

Direction of clustering Bottom-up (agglomerative): Start from individual

genes, join them into groups until only one group is left

Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene

Agglomerative clustering is computationally less expensive Why?

Hierarchical clustering methods are greedy Once a decision is made, it cannot be undone

Agglomerative clustering

Start with m clusters: Each cluster contains one gene

At each step, choose two clusters that are closest (or most correlated), merge them

How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close

genes, than the clusters are close to each other)(min),(

CjCilk

Agglomerative Clustering

Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other

Group average: Two clusters are close to each other if their centers are close to each other

k lCi Cj

lk CCCC 1

)(max),(,

ijCjCi

Divisive Clustering

Recursive bipartitioning Find an “optimal” partitioning of the genes into two

clusters Recursively work on each partition Since the number of clusters is an issue for partitioning

based clustering algorithms, the magic number 2 solves a lot of problems

May be computationally expensive The problem is “global” At every level of the tree, we have to work on all of the

genes If tree is imbalanced, there might be as many as m

levels With a reasonable stopping criterion, maybe

considered a partition-based clustering as well

Partition Based Clustering

Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters

Easily interpratable Especially, for large datasets (as compared to

hierarchical)

Number of Clusters

Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data

It is very difficult to partition data into an “unknown” number of clusters

Most algorithms assume that K (number of clusters) is known

Try different values of K, find the one that results in best clustering

Very expensive

Overlapping vs. Disjoint Clusters

Genes do not have a single function Most genes might be involved in

different processes, so their expression profiles might demonstrate similarities with different genes in different contexts

Can we allow a gene to be included in more than one cluster?

Allowing overlaps between clusters poses additional challenges To what extent do we allow overlaps?

(We definitely don’t want to identify two identical clusters)

Fuzzy Clustering

Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster Difficult interpretation Partitioning is a special case of fuzzy clustering,

where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some

sense Continuous relaxation might alleviate

computational complexity as well

K-Means Clustering

The most famous clustering algorithm Given K, find K disjoint clusters such that the

total intracluster variation is minimized

iik e ),(

Cluster mean:

Intracluster variation:

Total intracluster variation:

K-Means Algorithm

K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible

1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters

2. Assign each gene to a cluster2.1. Each gene is assigned to the cluster with closest

center to its profile

3. Redetermine cluster centers4. If any gene was moved, go back to Step 2, else

Sample Run of K-Means

Self Organizing Maps

Just like K-means, we have K clusters, but this time they are organized into a map Often a 2D grid We want to organize clusters so that similar

clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space

Just like K-means, each cluster is associated with a weight vector It was the cluster center in K-means

Each weight vector is first initialized randomly to some gene’s expression profile

SOM Algorithm

At each step, a gene is selected at random The distance between the gene’s expression

profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner

The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better

Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the

neighborhood function

))()(,()()()1( ikjkkk etwCCttwtw

Sample SOM Output

Gene Co-expression Network

Nodes represent genes Weighted edges between nodes represent

proximity (correlation) between genes’ expression profiles

This is indeed a way of predicting interactions between genes

Graph Theoretical Clustering

Partition the graph into heavy subgraphs Maximize total weight (number of edges) inside a

cluster Minimize total weight (number of edges) between

clusters Heuristic algorithms

CLICK: Recursive min-cut CAST: Iterative improvement one by one for each

cluster Loss of information?

Model Based Clustering

Generating model Each cluster is associated with a distribution (that

generates expression profiles for associated genes) specified by model parameters

The probability that a gene belongs to a cluster is specified by hidden parameters

Expectation Maximization (EM) algorithm Start with a guess of model parameters E-step: Compute expected values of hidden parameters

based on model parameters M-step: Based on hidden parameters, estimate model

parameters to maximize the likelihood of observing the data at hand, iterate

K-means is a special case

Evaluation of Clusters

In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity

Homogeneity, separation Based on the proximity metric

Reference partition Information on “true clusters” that comes from a

different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity

Cluster annotation Processes that are significantly enriched in a cluster

Homogeneity & Separation

Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster?

Separation How dissimilar are different clusters?

Good clustering: high heterogeneity, low separation

ijCCCH

k lCi Cj

lk CCCCS 1

Overall Quality

Overall heterogeneity

Overall separation

How do these change with respect to number of clusters? Can we optimize these values to choose the best

number of clusters?

kk CHCm

CClklk

CCSCCCC

Bayesian Information Criterion

A statistical criterion for evaluating a model Penalizes model complexity (number of free

parameters to be estimated)

k is the number of free parameters in the model, which increases with the number clusters

RSS is the “total error” in the model Trade-off number of clusters and optimization

function to choose the best number of clusters

Reference Partitioning

If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning

Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the

same cluster by the clustering algorithm, 0 otherwise

Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition

jiijij

RCnRCn

Comparing Partitions

Rand index (symmetric)

Jaccard coefficient (sparse)

Minkowski measure (sparse)

01100011

nnRand

011011

nJaccard

nnMinkowski

Cluster Annotation

Clustering results in groups of genes that are co-expressed (or co-regulated) For each group, can we tell something about the

biological phenomena that underlies our observation (their co-expression)?

We have partial knowledge on the function of many individual genes Gene Ontology, COG (Clusters of Ortholog Groups),

PFAM (Protein Domain Families) Taking a statistical approach, we can assign

function to each group of genes A function popular in a cluster is associated with

that cluster

Gene Ontology

Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a

standardized library of cellular function Unified view of life: Processes, structures, and

functions recur in diverse organisms Three concepts of Gene Ontology

Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)

Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)

Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)

Hierarchy in Gene Ontology

Gene Ontology is hierarchical A process might have subprocesses

Seed maturation is part of seed development A process might be described at different levels of

detail Seed dormation is a(n example of) seed maturation

Same for function and component Gene Ontology terms are related to each other

via “is a” and “part of” relationships If process A is part of process B, then A is B’s child

(B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is

a more detailed specification of D

GO Hierarchy is a DAG

Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) A GO term can have

multiple parents (and obviously a GO term might (should?) have multiple children)

Annotation

GO-based annotation assigns GO terms to a gene A gene might have multiple functions, can be

involved in multiple processes Multiple genes might be associated with the same

function, multiple genes take part in a process True-path rule

If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)

How does the number of genes associated with each term changes as we go down on the GO DAG?

GO Annotation of Gene Clusters

There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with

t What is the association between cluster C and

term t? If we chose random clusters, would we be able to

observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?

What is the probability of this observation? Statistical significance based on

hypergeometric distribution

Hypergeometric Distribution

We have n items, m of which are good If we choose r items from the entire set of items

at random, what is the probability that at least k of them will be good?

n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an

underlying association between the term and the cluster (the term is significantly enriched in the cluster)

),min(

GO Hierarchy & Cluster Annotation

How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a

more general process Some processes might be over-represented in the

study set How do we find the best location of a cluster in GO

hierarchy? Parent-child annotation

Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster

The gene space is defined as the set of genes that are associated with t’s parents

Parent-Child Annotation

Multiple Hypotheses Testing

The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of

enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster

We have to account for all hypotheses being tested simultaneously

Bonferroni correction: Apply union rule, add all p-values

Which terms should we consider while correcting for multiple hypotheses for a single term?

Representativity of Terms

How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached

to the term? How many of the genes attached to the term are

in the cluster? For term t that is significantly enriched in

cluster C Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall

Biclustering

A particular process might be active in certain conditions A group of genes

might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples

They might behave almost independently under other conditions

Clustering vs. Biclustering

Clustering is a global approach Each gene is a point in the space defined by all

samples How about points that are clustered in a subspace?

Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not

only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident

Biclustering Applications

Sample/tissue classification for diagnosis The samples with leukemia show specific characters

for a subset of genes Identification of co-regulated genes

Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)

Functional annotation Biological processes, functional classes are

overlapping Different sets of samples reveal different functional

relationships

Biclustering Principles

A cluster of genes is defined with respect to a cluster of samples and vice versa

The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one

cluster A gene/condition may not belong to any cluster at

all Biclusters are not “perfect”

Noise Statistical inference becomes particularly

important

Biclustering Formulation

Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J

General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)

The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)

Coherence of a Submatrix

Distribution of Biclusters

Bipartite Graph Model

Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs

With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs

Row, Column, Matrix Means

Objective Function

Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance

Low-rank (constant row, constant column, coherent values) bicluster Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue

Missing Values

Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random

values, gene mean, sample mean, regression) Generalize definition row, column, and

bicluster means to handle missing values implicitly Occupancy threshold:A bicluster is one with adequate number of (non-missing) values in each row and column

Overlapping Biclusters

The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters

Plaid model: : contribution of bicluster k on the expression

value of the ith gene in the jth sample and (generally binary) specify the membership

of row i and column j in the kth bicluster, respectively

Minimize

is defined to reflect “bicluster type” , , ,

Discrete Coherence

A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)

Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an

ordering of its columns such that the sequences of values in every row is increasing

Gene expression motifs (xMOTIFs) The expression level of a gene is conserved across

a subset of conditions if the gene is in the same “state” in each of the conditions

An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples

Binary Biclusters

Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the

expression value PROXIMUS: A 1 means that the gene is “expressed” in

the corresponding sample A bicluster is a “dense submatrix”, i.e. one with

significantly more number of 1’s than one would expect Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a

submatrix Log-likelihood:

p-value:

Biclustering Algorithms

Enumeration Go for it!

Greedy algorithms Make a locally optimal choice at every step

Divide and conquer Solve problem recursively

Alternating iterative heuristics Fix one dimension, solve for other, alternate

iteratively Model Based Parameter estimation

e.g., EM algorithm

Enumerating Biclusters

m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary?

Put some restriction on size of biclusters SAMBA models the problem as one of finding

heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite

graph have bounded degree Find K heavy bipartite subgraphs (biclusters) with

bounded degree enumeration Refine them to optimize overlap and add/remove nodes

that improve bicluster quality

Greedy Algorithms

Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness,

Hill-climbing) Optimization function: mean-squared residue

Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue

Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue

Repeat these alternatingly to improve global awareness

Finding All Biclusters

If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with

random values First identify disjoint biclusters, then grow them to

capture overlaps Flexible Overlapped Biclustering (FLOC)

Generate K initial biclusters Make decision from the gene/sample perspective

(as compared to bicluster perspective): Choose the best (maximum gain) action for each gene

Generalizing K-Means to Biclustering

Assume K gene clusters, L sample clusters Notice that this is a little counter-intuitive, we do

not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster

R: mxk gene clustering matrix, C: nxl sample clustering matrix R(i,k)=1 if gene i belongs to cluster k (actually,

columns are normalized to have unit norm) Minimize total residue:

KL-Means Algorithm

We can show that Batch iteration

Given R, compute (mxl matrix) serves as a prototype for column

clusters For each column, find the column of that is

closest to that column, update the corresponding entry of C accordingly

Once C is fixed, repeat the same for rows to compute R from

Converges to a local minimum of the objective function

OPSM Algorithm Recall that an order preserving submatrix (OPSM)

is one such that all rows have their entries in the same order

Growing partial models Fix the extremes first The idea: Columns with very high or low values are

more informative for identifying rows that support the assumed linear order

Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones

Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster

Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering)

Recursive bipartitioning Sort rows according to their mean, choose a row such

that the total variance above and below the row is minimized

Do the same for columns Pick the row or column that results in minimum intra-

cluster variances, split matrix into two based on that row or column

Continue splitting recursively One problem is that once two rows/columns go to

different biclusters, they can never come together Gap Statistics: Find a large number of biclusters, then

recombine

Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns

Here, R and C are diagonal matrices that contain row

and column means, respectively Bistochastization

Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant

Repeat independent scaling of rows and columns until stability is reached

The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean

Spectral Biclustering Singular value decomposition

The eigenvalues of the matrices ATA and AAT (say, σ2) are the same

Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors

If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1

T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1

(over all orthogonal vector pairs with unit norm)

Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v Split matrix based on u and v

6. Gene Regulatory Networks

Regulation of Gene Expression

Transcriptional Regulation of telomerase protein component gene hTERT

Genetic Regulation & Cellular Signaling

Organization of Genetic Regulation

GeneUp-regulation

Down-regulation

Negative ligand-independent repression at chromatin level

Genetic network that controls flowering time in A. thaliana(Blazquez et al, EMBO Reports, 2001)

Gene Regulatory Networks Transcriptional Regulatory Networks

Nodes with outgoing edges are limited to transcription factors

Can be reconstructed by identifying regulatory motifs (through clustering of gene expression & sequence analysis) and finding transcription factors that bind to the corresponding promoters (through structural/sequence analysis)

Gene Regulatory Networks Gene expression networks

General model of genetic regulation Identify the regulatory effects of genes on each

other, independent of the underlying regulatory mechanism

Can be inferred from correlations in gene expression data, time-series gene expression data, and/or gene knock-out experiments

Observation Inference

Boolean Network Model

Binary model, a gene has only two states ON (1): The gene is expressed OFF (0): The gene is not expressed

Each gene’s next state is determined by a boolean function of the current states of a subset of other genes A boolean network is specified by two sets Set of nodes (genes) State of a gene: Collection of boolean functions

Logic Diagram

Cell cycle regulation

Retinoblastma (Rb) inhibits DNA synthesis

Cyclin Dependent Kinase 2 (cdk2) & cyclin E inactivate Rb to release cell into S phase

Up-regulated by CAK complex and down-regulated by p21/WAF1

Wiring Diagram

Dynamics of Boolean Networks Gene activity profile (GAP)

Collection of the states of individual genes in the genome (network) The number of possible GAPs is 2n

The system ultimately transitions into attractor states Steady state (point) attractors Dynamic attractors: state cycle Each transient state is associated with an attractor

(basins of attraction) In practice, only a small number of GAPs correspond to

attractors What is the biological meaning of an attractor?

State Space of Boolean Networks Equate cellular with

attractors Attractor states are

stable under small perturbations Most perturbations

cause the network to flow back to the attractor

Some genes are more important and changing their activation can cause the system to transition to a different attractor

This slide is taken from the presentation by I. Shmulevich

Identification of Boolean Networks We have the “truth table” available

Binarize time-series gene expression data REVEAL

Use mutual information to derive logical rules that determine each variable If the mutual information between a set of variables and the

target variable is equal to the entropy of that variable, then that set of variables completely determines the target variable

For each variable, consider functions consisting of 1 variable, then 2, then 3, …, then i…, until one is found Once the minimum set of variables that determine a variable is

found, we can infer the function from the truth table In general, the indegrees of genes in the network is small

REVEAL

Limitations of Boolean Networks The effect of intermediate gene expression

levels is ignored It is assumed that the transitions between

states are synchronous A model incorporates only a partial description

of a physical system Noise Effects of other factors

One may wish to model an open system A particular external condition may alter the

parameters of the system Boolean networks are inherently deterministic

Probabilistic Models Stochasticity can account for

Noise Variability in the biological system Aspects of the system that are not captured by the

model Random variables include

Observed attributes Expression level of a particular gene in a particular

sample Hidden attributes

The boolean function assigned to a gene?

Probabilistic Boolean Networks Each gene is associated with multiple boolean

functions Each function is associated with a probability

Can characterize the stochastic behavior of the system

Bayesian Networks A Bayesian network is a representation of a joint

probability distribution A Bayesian network B=(G, ) is specified by two

components A directed acyclic graph G, in which directed edges

represent the conditional dependence between expression levels of genes (represented by nodes of the graph)

A function that specifies the conditional distribution of the expression level of each gene, given the expression levels of its parents Gene A is gene B’s parent if there is a directed edge from A

to B P(B | Pa(B)) = (B, Pa(B))

Conditional Independence In a Bayesian network, if no direct between two

genes, then these genes are said to be conditionally independent

The probability of observing a cellular state (configuration of expression levels) can be decomposed into product form

Variables in Bayesian Network Discrete variables

Again, genes’ expression levels are modeled as ON and OFF (or more discrete levels)

If a gene has k parents in the network, then the conditional distribution is characterized by rk parameters (r is the number of discrete levels)

Continuous variables Real valued expression levels We have to specify multivariate continuous

distribution functions Hybrid networks

Equivalence Classes of Bayesian Nets Observe that each network structure implies a

set of independence assumptions

More than one graph can imply exactly the same set of independencies (e.g., X->Y and Y->X) Such graphs are said to be equivalent

By looking at observations of a distribution, we cannot distinguish between equivalent graphs An equivalence class can be uniquely represented

by a partially directed graph (some edges are undirected)

Learning Bayesian Networks Given a training set D = {x1, x2, …, xn} of m

independent instances of the n random variables, find an equivalence class of networks B=(G, ) that best matches D x’s are the gene expression profiles

Based on Bayes’ formula, the posterior probability of a network given the data can be evaluated as

where C is a constant (independent of G) and

is the marginal likelihood that averages the probability of data

over all possible parameter assignments to G

Learning Algorithms The Bayes score S(G : D) depends on the particular

choice of priors P(G) and P( | G) The priors can be chosen to be

structure equivalent, so that equivalent networks will have the same score

decomposable, so that the score can be represented as the superposition of contributions of each gene

The problem becomes finding the optimal structure (G) We can estimate the gain associated with addition,

removal, and reversal of an edge Then, we can use greedy-like heuristics (e.g., hill

climbing)

Causal Patterns Bayesian networks model dependencies between

multiple measurements How about the mechanism that generated these

measurements? Causal network model: Flow of causality

Model not only the distribution of observations, but also the effect of observations

If gene X codes for a transcription factor of gene Y, manupilating X will affect Y, but not vice versa

But in Bayesian networks, X->Y and Y->X are equivalent

Intervention experiments (as compared to passive observation): Knock X out, then measure Y

Dynamic Bayesian Networks Dependencies do not

uncover temporal relationships Gene expression

varies over time Dynamic Bayesian

Networks model the dependency between a gene’s expression level at time t and expression levels of parent genes at time t-1

Topology of Biological Networks

Topological Characteristics of Networks Local characteristics

Subgraphs, motifs Clustering

Global characteristics Degree distribution Reachability Hierarchy, assortativity

Topology & Function Robustness: Degree distribution, reachability,

hieararchy Modularity: Motifs, clustering, hierarchy Dynamics: Do general topological properties

determine behavior?

8. Topology of Biological Networks

Real-World Networks

Biological networks at different scales Population, tissue, cell

Cellular networks Metabolic pathways, transcriptional networks,

protein-protein interactions Other networks

Internet, social networks (Erdös number, Kevin Bacon network, friendship), electronic circuits, parallel computers

Understanding Networks Are there commonalities in the topological

characteristics of different networks? Turns out to be yes Do these have anything to do with function, origin,

growth of these networks? How about differences?

Internet vs. S. cerevisiae PPI network

Graphs vs. Networks A network is a “functional”

structure, in which nodes and links are “active” Information flow Underlying dynamics

A graph is an abstraction of a network (or, in general, pairwise relationships between entities)

The two terms are commonly used interchangeably

Modeling Networks

Graph representation of metabolic pathways (a) Edges may represent substrate-product relationships

between metabolites (b), or producer-consumer relationships between enzymes

We can drop “common” metabolites [c]

Directed vs. Undirected Graphs An edge (or link) is directed if there is a

specified directionality (such as cause and effect) in the relationship between the two objects represented by the nodes Metabolic pathways are directed, because

many reactions are irreversible Protein-protein interactions are generally

undirected, because in most cases all we know is that they bind to each other

The semantics of topological properties (and/or motifs) may be different for directed and undirected graphs Cycle

Connectivity Degree of a node in the network

How many links does a node have to other nodes?

Social networks: How “social” or “active” is a person?

Protein-protein interactions: How “sticky” or “functional” is a protein?

Directed graphs In-degree and out-degree Internet: How popular is a website? Metabolic pathways: How many

reactions use a metabolite as substrate?

Degree Distribution Define P(k) as the

probability (relative frequency) that a selected node has exactly k links N(k) = Number of nodes

with degree k P(k) = N(k)/N, where N is

the total number of nodes Degree is a local property,

degree distribution is a global property

Average degree distribution of metabolic networks of 43

organisms

Reachability Path

Sequence of nodes that are linked to each other that connect two specified nodes to each other

Shortest path The path between two nodes that contains

minimum number (length) of edges Quantifies the reachability between two

molecules, length of shortest path is a.k.a distance

Network’s overall navigability Diameter: Maximum distance in a network Mean path length: Average distance in a

network

Small World Effect First identified on social networks

People were sent letters and asked to forward the letter to them if they personally knew a specified person, if not they were supposed to send it to a fried who could be likely to…

Result: “Six degrees of separation” (average) Most natural networks demonstrate small world

phenomenon Neural networks, WWW

Metabolism Paths of three or four reactions can link most metabolite

pairs Local perturbations in metabolite concentrations can

reach the whole network very quickly

Clustering A network is clustered if we can say that

If A and B are connected and B and C are connected, than it is likely that A and C are connected

Clustering coefficient The fraction of observed triangles among all possible

triangles around a node , where k is node degree, and nI is the

number of pairs of neighbors of I that are connected to each other

Distribution of clustering coefficients The function C(k): Average clustering coefficient of

nodes with degree k Diameter and average degree depend on total number of

nodes, but P(k) and C(k) do not

Random Graphs Known as Erdös-Renyi graphs

Mark N nodes, draw an edge between any pair of proteins with fixed probability p

Mean path length is proportional to log(N) Degree distribution peaks around average degree,

clustering coefficient does not depend on degree

Random network, degree distribution, clustering coefficient distribution

Scale-Free Networks The degree distribution follows a power-law

P(k) k, where is the degree exponent parameter

In other words, the number of nodes with degree k is inversely proportional to an exponent of k Many low-degree nodes, a few hubs

Mean path length is proportional to log(log(N)) Terminology: Absence of a typical node in the

network

Scale-free network, degree distribution, clustering coefficient distribution

Scale-Free Networks in Nature

Metabolic network, Actor collaboration, WWW, Power grid

2.26, 2.3, 2.1, 4

Mathematical Model for Power Law There are y nodes of degree x, where

, i.e., The maximum degree in the graph is The number of vertices, n is:

where is the Riemann zeta function The number of edges, E is

Role of Degree Exponent The smaller the value of , the more important

the hubs are If then the hubs are not relevant For 2 < < 3, then there is a hierarchy of hubs, with

the most connected hub being in contact with a small fraction of all nodes

For 2, a star-like network emerges, with the largest hub being in contact with a large fraction of nodes

Scale-free networks are generally interesting for Unusual properties emerge for this regime This is the range that is observed in most biological (as

well as non-biological) networks

Hierarchical Networks “General” scale free networks still do

not capture one observed property of cellular networks

Hierarchical networks are clusters of clusters of clusters of… connected through local hubs, less local

hubs, …, global hubs

Average clustering coefficientdistribution for the metabolic networks of 43 organisms

C(k) 1/k

Hubs in Cellular Networks PPI networks

Kinases form the core of the network Genetic regulation

Most transcriptional factors regulate a few genes, a few general transcription factors interact with many genes

Recall that, the gene expression matrix of yeast cell cycle contains a few strong principal components

However, incoming degree distribution is rather approximated by an exponential function

Most genes are regulated by only one to three transcription factors

Growth Models How do these networks gain these properties? Preferential attachment

At each time point, a node is added to the network, and connected to a node with probability that is proportional to the current degree of that node 1st order:

2nd order:

This growth model generates scale-free networks with degree exponent 3

Duplication/Divergence Gene duplications are

considered as one of the driving forces of molecular evolution When gene is duplicated,

the corresponding protein has two copies, so an additional node with the same neighbors is added to the network

Proteins with already high degree are likely to have their neighbors duplicated => Preferential attachment

Network Evolution & Topology Scale free model predicts that the nodes that

appeared early in the history of the network are the most connected nodes Remnants of the RNA world, such as coenzyme A,

NAD, GTP are among the most connected substrates in the metabolic network

Elements of most ancient metabolic pathways, such as glycolisis and tricarboxylic acid cycle

In PPI networks, cross-genome comparisons indicate that, on an average, there is positive correlation between evolutionary history and number of links a protein has

Assortativity Social networks are generally assortative

People who know many people also know each other

Productive authors do write papers together Most cellular networks are observed

to be disassortative In general, hubs avoid linking directly to each

other Metabolic pathways, PPI networks, as well as

WWW Function of disassortativity? Selective value of

dissartotivity? Evolution of disassortativity? Do existing models generate disassortativity?

Modules & Clustering Modularity

Groups of physically or functionally linked molecules that work together to perform a (relatively) distinct function

Friend groups in social networks, labs in co-authorship Protein complexes Temporally co-regulated groups of genes

High clustering in cellular networks Average clustering coefficient is independent of network

size for metabolic pathways For an arbitrary scale-free network, average clustering

coefficient decreases by network size PPI and DDI networks also have high clustering

coefficients

Subgraphs as Elementary Units Subgraphs capture specific patterns of interconnections

that characterize a given network at the local level Not all subgraphs are equally significant

The abundance of squares and the absence of triangles can tell us something fundamental about the architecture of the square lattice

Does not exist at all!

Abundant!

Network Motifs A network motif is a subgraph (in topological

terms, i.e., ignoring identity of nodes) that occurs much more frequently in the network of interest, compared to a random network that has similar global properties

Generating Random Graphs Random graphs are used to assess the

statistical significance of the frequency of a motif As degree distribution is the key characteristic of

scale-free networks, generally the graphs are randomized to preserve degree distribution

Simulation Edge switching algorithm

Analytical Methods Arbitrary degree distribution

The probability of existence of an edge between u and v is defined as

where du and dv are specified “expected degrees” of u and vObserve that E[Du] = du where Du is the corresponding R.V.

However, in order for P to be a well-defined probability function, we must have

whereIn general, this is not the case for PPI networksThese models are generally useful for multigraphs rather

than simple graphs, because of dependencies

Common Network Motifs

PPI and Transcription Integrate protein-protein interactions and

transcriptional regulation Motifs might reveal how these two types of interaction

work together for regulation of cellular processes

Possible interaction patterns between a pair of proteins Red directed arrows represent transcriptional

regulation Black bidirectional arrows represent protein-protein

interaction

Motifs in Integrated TRI-PPI Network

Conservation of Motifs Is there a

“selective value” of motifs?

If motifs are conserved, then one might expect that proteins that are parts of motifs will also be conserved

Motif Constituents Conserved Together

Motif Clusters

Motifs generally tend to form clusters Hierarchical

modularity On the left, 209

bi-fan motifs on Ecoli TRN are shown altogether Shared edges

are in blue, others in red

Topological Robustness Scale-free networks are robust to random attacks

It is not easy to disconnect the network via random node deletions

In random (Erdös-Renyi) graphs, the network falls apart when the number of accidental node failures reach to a certain threshold

Scale- free networks do not have such threshold: Even if 80% random nodes fail, remaining 20% are still connected

Attack vulnerability Dependence on hubs If a key hub fails, the network turns into a collection

of small isolated node clusters

Robustness, Lethality, & Redundancy Lethal proteins

Only about 10% of nodes with less than 5 interactions are essential

This rate is 60% for proteins with more than 15 interactions

Redundancy: Only 18.7% of S. cerevisiae proteins are lethal when deleted individually

Evolution of robustness Highly connected yeast genes have a smaller

evolutionary distance to their orthologs in C. elegans

The structure of important proteins is subject to more selective pressure

Functional and Dynamical Robustness Nodes have different biological function

Network topology is not a sole indicator of lethality Experimentally identified protein complexes

tend to be composed of uniformly essential or non-essential proteins Dispensability of whole complex determines

importance of subunits

Microarrays, Expression, and Regulatory Networks

Documents

Transcript of Microarrays, Expression, and Regulatory Networks

Gene Expression Profiling by Microarrays - Assetsassets.cambridge.org/97805218/53965/frontmatter/9780521853965... · Gene Expression Profiling by Microarrays Clinical Implications

Microarrays & Gene Expression Analysis. Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE.

Microarrays and Gene Expression Arrays Francisco Millan, Michelle Measar, Gurleen Kaur.

Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.

Gene expression Guy Nimrod. Microarrays The microarrays technology is aimed to measure the gene expression profile of cell. This is done by measuring.

Expression Data and Microarrays CMMB November 29, 2001 Todd Scheetz.

Measuring gene expression (Microarrays)€¦ · • Gene expression • Microarrays – Idea – Technologies – Problems • Quality control • Normalization • Analysis next

Large Scale Gene Expression with DNA Microarrays

Gene Co-expression Networks Across Many Microarrays

DNA Microarrays Examining Gene Expression. Prof. GrossBiology 4 DNA MicroArrays DNA MicroArrays use hybridization technology to examine gene expression.

Normalization and quantification of differential expression in gene … · 2011-09-15 · Normalization and quantification of differential expression in gene expression microarrays

using machine learning to design and interpret gene-expression microarrays

Microarrays & Gene Expression Analysis

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.

Affymetrix Gene Expression Microarrays

Assessing gene expression quality in Affymetrix microarrays

DNA Microarrays and gene expression

Introduction to Affymetrix Microarrays Why use microarrays? Permits expression profiling of thousands of genes in parallel Why use Affymetrix microarrays?

Introduction to MicroArrays and Gene Expression Profiling · Introduction to MicroArrays and Gene Expression Pro ling Haibe-Kains B1;2 Bontempi G2 Sotiriou C1 1Unit e Microarray,

cDNA microarrays for gene expression studies in complex ......cDNA microarrays for gene expression studies in complex disease. Hanahan and Weinberg Cell 100, 57-70 (2000) genome projects: