Analysis of gene expression data (Nominal explanatory variables)

66
Analysis of gene expression data Analysis of gene expression data (Nominal explanatory variables) (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC

description

Analysis of gene expression data (Nominal explanatory variables). Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC. Outline of the talk. Two types of explanatory variables (“experimental conditions”) - PowerPoint PPT Presentation

Transcript of Analysis of gene expression data (Nominal explanatory variables)

Page 1: Analysis of gene expression data (Nominal explanatory variables)

Analysis of gene expression dataAnalysis of gene expression data

(Nominal explanatory variables)(Nominal explanatory variables)

Shyamal D. PeddadaBiostatistics Branch

National Inst. Environmental Health Sciences (NIH)

Research Triangle Park, NC

Page 2: Analysis of gene expression data (Nominal explanatory variables)

Outline of the talkOutline of the talk

Two types of explanatory variables (“experimental conditions”)

Some scientific questions of interest

A brief discussion on false discovery rate (FDR) analysis

Some existing statistical methods for analyzing microarray data

Page 3: Analysis of gene expression data (Nominal explanatory variables)

Types of explanatory variables

Page 4: Analysis of gene expression data (Nominal explanatory variables)

Types of explanatory variables (“experimental conditions”)

Nominal variables:

– No intrinsic order among the levels of the explanatory variable(s).

– No loss of information if we permuted the labels of the conditions.

E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue.

Page 5: Analysis of gene expression data (Nominal explanatory variables)

Types of explanatory variables (“experimental conditions”)

Ordinal/interval variables:

– Levels of the explanatory variables are ordered.

– E.g.

Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered)

Time-course/dose-response experiments. (numerically ordered)

Page 6: Analysis of gene expression data (Nominal explanatory variables)

Focus of this talk: Nominal explanatory variables

Page 7: Analysis of gene expression data (Nominal explanatory variables)

Types of microarray dataTypes of microarray data

Independent samples

– E.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients.

Dependent samples

– E.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient.

Page 8: Analysis of gene expression data (Nominal explanatory variables)

Possible questions of interestPossible questions of interest

Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates).

Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation.

Cluster genes according to patterns of expression over “conditions”.

Other questions?

Page 9: Analysis of gene expression data (Nominal explanatory variables)

ChallengesChallenges

Small sample size but a large number of genes.

Multiple testing – Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates.

Complex dependence structure between genes and possibly among samples.

– Difficult to model and/or account for the underlying dependence structures among genes.

Page 10: Analysis of gene expression data (Nominal explanatory variables)

Multiple Testing:Type I Errors

- False Discovery Rates …

Page 11: Analysis of gene expression data (Nominal explanatory variables)

The Decision TableThe Decision Table

Number of Not

rejected

Number of

rejected

Total

Number of True

Number of True

0H

aH

0H 0H

0mU V

T S 1m

Total W R m

The only observable values

Page 12: Analysis of gene expression data (Nominal explanatory variables)

Strong and weak control of type I error rates

Strong control: control type I error rate under any combination of true

Weak control: control type I error rate only when all null hypotheses are true

Since we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate.

H H a and0

Page 13: Analysis of gene expression data (Nominal explanatory variables)

Consequences of multiple testing

Suppose we test each hypothesis at 5% level of significance.

– Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 – 0.9510 = 0.401.

– If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positives!

Page 14: Analysis of gene expression data (Nominal explanatory variables)

Types of errors in the context of multiple testing

Per-Family Error “Rate” (PFER): E(V )

– Expected number of false rejection of

Per-Comparison Error Rate (PCER): E(V )/m

– Expected proportion of false rejections of among all m hypotheses.

Family-Wise Error Rate (FWER): P( V > 0 )

– Probability of at least one false rejection of among all m hypotheses

0H

0H

0H

Page 15: Analysis of gene expression data (Nominal explanatory variables)

Types of errors in the context of multiple testing

False Discovery Rate (FDR):– Expected proportion of Type I errors among all rejected

hypotheses.

Benjamini-Hochberg (BH): Set V/R = 0 if R = 0.

Storey: Only interested in the case R > 0. (Positive FDR)

)0( )0|( )1( }0{ RPRR

VE

R

VE R

)0|( )1( }0{ RR

VE

R

VEpFDR R

Page 16: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalitiesSome useful inequalities

(1) 1

e therefor, Since

}0{

RR

V

m

V

mRV

(3) 1Also

(2) . 11 Thus

.1 1 Therefore

}0{

}0{}0{

}0{}0{

V R

V

RV

V

VR

VR

00 and since Again, VRRV

Page 17: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

(4) 11

:have we(3), and (2) (1), Combining

}0{}0{ VR

V

m

VVR

(5) }{ }1{1

:have we(4)in nsexpectatio Taking

}0{}0{ VEER

VE

m

VE VR

Page 18: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

(6)

:have weThus

PFERFWERFDRPCER

(7)

Trivially

pFDRFDR

Page 19: Analysis of gene expression data (Nominal explanatory variables)

Conclusion

It is conservative to control FWER rather than FDR!

It is conservative to control pFDR rather than FDR!

Page 20: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

FWER? Is pFDRQuestion:

Page 21: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

RVS

mmmNote

0

0 : 10

. Suppose :Example 0 mm

FWERVP

E

R

VEFDR

V

R

)0(

)1(

1

}0{

}0{

Page 22: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

.1)0|1(

0|But

RE

RR

VEpFDR

FWERFDRpFDR

mm

1

then if Hence 0

Page 23: Analysis of gene expression data (Nominal explanatory variables)

Some useful inequalities

However, in most applications such as microarrays, one expects

In general, there is no proof of the statement

01 m

FWERpFDR

Page 24: Analysis of gene expression data (Nominal explanatory variables)

Some popular Type I error Some popular Type I error controlling procedurescontrolling procedures

Let denote the ordered

p-values for the ‘m’ tests that are being performed.

Let denote the ordered

levels of significance used for testing the ‘m’ null hypotheses, respectively.

)()2()1( ... mPPP

)()2()1( ... m

)(0)2(0)1(0 ,...,, mHHH

Page 25: Analysis of gene expression data (Nominal explanatory variables)

Some popular controlling procedures

Step-down procedure:

)()2()1( ...,, rHHH

on. so and

Stop. Else

3 Step Goto - reject then If :3 Step

Stop. Else

3 Step Goto - reject then If :2 Step

Stop. Else

2 Step Goto - reject then If :1 Step

)3(0)3()3(

)2(0)2()2(

)1(0)1()1(

HP

HP

HP

Page 26: Analysis of gene expression data (Nominal explanatory variables)

Some popular controlling procedures

Step –up procedure:

on! so and

4. Step goto Else

stop. and 2,...2,1,reject then If :3 Step

3. Step goto Else

stop. and 1,...2,1,reject then If :2 Step

2. Step goto Else

stop. and ,...2,1,reject then If :1 Step

)(0)2()2(

)(0)1()1(

)(0)()(

miHP

miHP

miHP

imm

imm

imm

Page 27: Analysis of gene expression data (Nominal explanatory variables)

Some popular controlling procedures

Single-step procedure

A stepwise procedure with critical same critical constant for all ‘m’ hypotheses.

)()2()1( ... m

Page 28: Analysis of gene expression data (Nominal explanatory variables)

Some typical stepwise procedures: FWER controlling procedures

Bonferroni: A single-step procedure with

Sidak: A single-step procedure with

Holm: A step-down procedure with

Hochberg: A step-up procedure with

minP method: A resampling-based single-step procedure with

where be the α quantile of the distribution of

the minimum p-value.

mi /

)1/( imi

)1/( imi

ci

mi

/1)1(1

c

Page 29: Analysis of gene expression data (Nominal explanatory variables)

Comments on the methodsComments on the methods

Bonferroni: Very general but can be too conservative for large number of hypotheses.

Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence.

Page 30: Analysis of gene expression data (Nominal explanatory variables)

Comments on the methodsComments on the methods

Holm: More powerful than Bonferroni and is applicable for any type of dependence structure between test statistics.

Hochberg: More powerful than Holm’s procedure but the test statistics should be either independent or the test statistic have a MTP2 property.

Page 31: Analysis of gene expression data (Nominal explanatory variables)

Comments on the methods

Multivariate Total Positivity of Order 2 (MTP2)

f (x) is said to MTP2 if for all x,y R p ,

f (x y) f (x y) f (x) f (y)

Page 32: Analysis of gene expression data (Nominal explanatory variables)

Some typical stepwise procedures: FDR controlling procedure

Benjamini-Hochberg:

A step-up procedure with mii /

Page 33: Analysis of gene expression data (Nominal explanatory variables)

An IllustrationAn Illustration

Lobenhofer et al. (2002) data:

Expose breast cancer cells to estrodial for 1 hour or (12, 24 36 hours).

Number of genes on the cDNA 2 spot array - 1900.

Number of samples per time point 8.,

Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test.

Page 34: Analysis of gene expression data (Nominal explanatory variables)

Some Popular Methods of Analysis

Page 35: Analysis of gene expression data (Nominal explanatory variables)

1. Fold-change

Page 36: Analysis of gene expression data (Nominal explanatory variables)

1. Fold-change in gene expression1. Fold-change in gene expression

For gene “g” compute the fold change between two conditions (e.g. treatment and control):

cont

trtg X

Xf

Page 37: Analysis of gene expression data (Nominal explanatory variables)

1. Fold-change in gene expression1. Fold-change in gene expression

: pre-defined constants.

: gene “g” is “up-regulated”.

: gene “g” is “down-regulated”.

fg R1

fg R2

21, RR

Page 38: Analysis of gene expression data (Nominal explanatory variables)

1. Fold-change in gene expression1. Fold-change in gene expression

Strengths:

– Simple to implement.– Biologists find it very easy to interpret.– It is widely used.

Drawbacks:

– Ignores variability in mean gene expression.– Genes with subtle gene expression values can be

overlooked. i.e. potentially high false negative rates

– Conversely, high false positive rates are also possible.

Page 39: Analysis of gene expression data (Nominal explanatory variables)

2. t-test type procedures2. t-test type procedures

Page 40: Analysis of gene expression data (Nominal explanatory variables)

2.1 Permutation t-test2.1 Permutation t-test

For each gene “g” compute the standard two-sample

t-statistic:

where are the sample means and is

the

pooled sample standard deviation.

conttrtg

contgtrtgg

nnS

XXt

11,,

contgtrtg XX ,, ,

Sg

Page 41: Analysis of gene expression data (Nominal explanatory variables)

2.1 Permutation t-test2.1 Permutation t-test

Statistical significance of a gene is determined by

computing the null distribution of using either

permutation or bootstrap procedure.

gt

Page 42: Analysis of gene expression data (Nominal explanatory variables)

2.1 Permutation t-test2.1 Permutation t-test

Strengths:

– Simple to implement.– Biologists find it very easy to interpret.– It is widely used.

Drawback:

– Potentially, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates.

Page 43: Analysis of gene expression data (Nominal explanatory variables)

2.2 SAM procedure2.2 SAM procedure(Significance Analysis of Microarrays) (Significance Analysis of Microarrays)

(Tusher et al., PNAS 2001)(Tusher et al., PNAS 2001)

For each gene “g” modify the standard two-sample t-statistic as:

The “fudge” factor is obtained such that the

coefficient of variation in the above test statistic is

minimized.

conttrtg

contgtrtgg

nnSs

XXd

110

,,

0s

Page 44: Analysis of gene expression data (Nominal explanatory variables)

3. F-test and its variations for 3. F-test and its variations for more than 2 nominal conditionsmore than 2 nominal conditions

Usual F-test and the P-values can be obtained by a suitable permutation procedure.

Regularized F-test: Generalization of Baldi and Long methodology for multiple groups.

– It better controls the false discovery rates and the powers comparable to the F-test.

Cui and Churchill (2003) is a good review paper.

Page 45: Analysis of gene expression data (Nominal explanatory variables)

4. Linear fixed effects models4. Linear fixed effects models

Effects:

– Array (A) - sample– Dye (D)– Variety (V) – test groups– Genes (G)– Expression (Y)

Page 46: Analysis of gene expression data (Nominal explanatory variables)

4. Linear fixed effects models4. Linear fixed effects models(Kerr, Martin, and Churchill, 2000)(Kerr, Martin, and Churchill, 2000)

Linear fixed effects model:

ijkgkgjgig

ijgjiijkg

VGDGAG

ADGDAY

)()()(

)()log(

vkVGH kg ,...,2,1 allfor 0)(:0

).,0(~ 2 Niid

ijkg

Page 47: Analysis of gene expression data (Nominal explanatory variables)

4. Linear fixed effects models4. Linear fixed effects models

All effects are assumed to be fixed effects.

Main drawback – all genes have same variance!

Page 48: Analysis of gene expression data (Nominal explanatory variables)

5. Linear mixed effects models5. Linear mixed effects models(Wolfinger et al. 2001)(Wolfinger et al. 2001)

Stage 1 (Global normalization model)

Stage 2 (Gene specific model)

gijijjigij TAATY )()log(

gijgjgiggij GAGTG )()(ˆ

Page 49: Analysis of gene expression data (Nominal explanatory variables)

5. Linear mixed effects models5. Linear mixed effects models

Assumptions:

),0(~

),0(~)( ),,0(~

),0(~)( ),,0(~

2

22

22

g

iid

gij

GAg

iid

gjijkg

TA

iid

ij

iid

i

N

NGAN

NTANA

Page 50: Analysis of gene expression data (Nominal explanatory variables)

5. Linear mixed effects models5. Linear mixed effects models(Wolfinger et al. 2001)(Wolfinger et al. 2001)

Perform inferences on the interaction term

giGT )(

Page 51: Analysis of gene expression data (Nominal explanatory variables)

A popular graphical representation:The Volcano Plots

A scatter plot of

vs

Genes with large fold change will lie outside a pair of vertical “threshold” lines. Further, genes which are highly significant with large fold change will lie either in the upper right hand or upper left hand corner.

)(log10 valuep ) (log2 changefold

Page 52: Analysis of gene expression data (Nominal explanatory variables)
Page 53: Analysis of gene expression data (Nominal explanatory variables)

A useful review articleA useful review article

Cui, X. and Churchill, G (2003), Genome Biology.

Software:

R package: statistics for microarray analysis.

http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html

SAM: Significance Analysis of Microarray. http://www-stat.stanford.edu/%7Etibs/SAM

Page 54: Analysis of gene expression data (Nominal explanatory variables)

Supervised classification algorithmsSupervised classification algorithms

Page 55: Analysis of gene expression data (Nominal explanatory variables)

Discriminant analysis based Discriminant analysis based methodsmethods

A. Linear and Quadratic Discriminant analysis based methods:

Strength:– Well studied in the classical statistics literature

Limitations:– Based on normality– Imposes constraints on the covariance matrices. Need

to be concerned about the singularity issue.

– No convenient strategy has been proposed in the literature to select “best” discrminating subset of genes.

Page 56: Analysis of gene expression data (Nominal explanatory variables)

Discriminant analysis based Discriminant analysis based methodsmethods

B. Nonparametric classification using Genetic Algorithm and K-nearest neighbors.– Li et al. (Bioinformatics, 2001)

Strengths:– Entirely nonparametric– Takes into account the underlying dependence structure

among genes– Does not require the estimation of a covariance matrix

Weakness:– Computationally very intensive

Page 57: Analysis of gene expression data (Nominal explanatory variables)

GA/KNN methodology – very brief GA/KNN methodology – very brief descriptiondescription

Computes the Euclidean distance between all pairs of samples based on a sub-vector on, say, 50 genes.

Clusters each sample into a treatment group (i.e. condition) based on the K-Nearest Neighbors.

Computes a fitness score for each subset of genes based on how many samples are correctly classified. This is the objective function.

The objective function is optimized using Genetic Algorithm

Page 58: Analysis of gene expression data (Nominal explanatory variables)

X

Expression levels of gene 1

Expre

ssio

n levels

of g

ene 2

K-nearest neighbors classification (k=3)

Page 59: Analysis of gene expression data (Nominal explanatory variables)

Expression levels of gene 1

Expre

ssio

n levels

of

gene 2

Subcategories within a class

Page 60: Analysis of gene expression data (Nominal explanatory variables)

Advantages of KNN approach

Simple, performs as well as or better than more complex methods

Free from assumptions such as normality of the distribution of expression levels

Multivariate: takes account of dependence in expression levels

Accommodates or even identifies distinct subtypes within a class

Page 61: Analysis of gene expression data (Nominal explanatory variables)

Expression data: many genes and few samples

There may be many subsets of genes that can statistically discriminate between the treated and untreated.

There are too many possible subsets to look at. With 3,000 genes, there are about 1072 ways to make subsets of size 30.

Page 62: Analysis of gene expression data (Nominal explanatory variables)

The genetic algorithm

Computer algorithm (John Holland) that works by mimicking Darwin's natural selection

Has been applied to many optimization problems ranging from engine design to protein folding and sequence alignment

Effective in searching high dimensional space

Page 63: Analysis of gene expression data (Nominal explanatory variables)

GA works by mimicking evolution

Randomly select sets (“chromosomes”) of 30 genes from all the genes on the chip

Evaluate the “fitness” of each “chromosome” – how well can it separate the treated from the untreated?

Pass “chromosomes” randomly to next generation, with preference for the fittest

Page 64: Analysis of gene expression data (Nominal explanatory variables)

Summary

Pay attention to multiple testing problem.

– Use FDR over FWER for large data sets such as gene expression microarrays

Linear mixed effects models may be used for comparing expression data between groups.

For classification problem, one may want to consider GA/KNN approach.

Page 65: Analysis of gene expression data (Nominal explanatory variables)
Page 66: Analysis of gene expression data (Nominal explanatory variables)