Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad...

Bioinformatics

Expression profiling and functional genomics

Part II: Differential expression

Ad 27/11/2006

ANOVA based

Filtering

Linearisation

Bootstrapping

Log transformation

Array by array approach

Filtering

normalization

Ratio

Test statistic (T-test)

Log transformation

Preprocessing

Background corrBackground corr

Overview further analysis

Raw data

Preprocessed data

Differentially expressed genes

Clusters of coexpressed

genes

Preprocessing

ClusteringTest statistic

Comparison of 2 experiments:

• Fold test

• T-test

• SAM

• …

A plethora of different method available

Which one performs best?

Different underlying statistical assumptions

Implication on the final result

Difficult to define the best method

Test Statistic

Preprocessing: test statistic

Type1: Comparison of 2 samples

Statistical testing

Control sample

Induced sample

Retrieve statistically over or under expressed genes

Diff Expr Genes: test statistic

black/white experiment description (array V mice genes)

• Condition 1 : pygmee mouse 10 days old (test)• Condition 2 : normal mouse 10 days old (ref)

detect differentially expressed genes

Experiment design (Latin Square)

Condition 1

Dye1

Replica L

Condition 1

dye1

Replica R

Condition 2

dye2

Replica L

Condition 2

dye2

Replica R

Condition 2

dye1

Replica L

Condition 2

dye1

Replica R

Condition 1

dye2

Replica L

Condition 1

dye2

Replica R

Arra

y 1A

rray 2

Per gene, per condition 4 measurements available

Diff Expr Genes : test statistic

Fold change (ratio test)

4 measurements per gene, condition

Calculate average

Sort averages

log(Sample/control) > threshold (usually 2)

• Arbitrary threshold

• Discards all information obtained from replicates

• Implicitly assumes constant variance but variance depends on expression value

refitestii IIM ,,


Why does fold chance fail:

• Majority of genes expressed at low levels where signal/noise is low

=> not sufficiently conservative

– 2 fold change occurs at random for a large number of genes

– High number of false positives

• Higher levels of expression smaller changes in gene expression may be real

=> too conservative

– High number of false negatives

Improvement:

– T-test

– pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples

– SAM http://www-stat-class.stanford/SAM/SAMServlet


• Possible if replicates of reference and test are available

• Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation (consistency)

• Assumptions

• Normal distribution of variables

• Population mean and variance estimated from data => (Student t distribution for H0 hypothesis)

• Not all genes need to have the same variance

• Under null hypothesis sample means should be equal (rescaling obligatory)

)(/)(22

test

test

ref

reftestrefunpaired n

s

n

smmt

T-test: hypothesis test


• Consider paired data as new variable

• Calculate average ratio

• Calculate standard deviation of the 4 ratio measurements

Determine t-value

df, student t distribution, t-value

nS

DT

D /0

p-value

p-value (represents the probability that a certain null hypothesis is true)

Paired t-test (microarray data are paired)

D


• Classical hypothesis tests (t-test, Wilcoxon rank-sum test, ...):

– a test statistic is calculated (t-value)

– the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true

– The null hypothesis: gene has no difference in mean expression levels between 2 conditions

– Low p-value (below rejection level ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes

t-test

H0 H1

H0: D=0

H1: D<>0

Gene x

Type I

Type II


Comparison of fold test with paired t-test

• Gene expression levels measured under two different conditions

• Rejection level

– pj < : null hypothesis rejected (result Positive)

– pj > : null hypothesis not rejected (result Negative)

• But: Multiple testing: Type I and Type II error

= False positives and negatives


• Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene

• H0 (expected relative difference) is estimated by permutation analysis

– Permute the samples

– Calculate d(i) values for both the experimental samples and the permutated control samples

– Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples

SAM


Observed values• Calculate d(I) value for each gene• Rank genes according to their d(I) value

Simulated values• Permute dataset• Calculate d(I) value for each gene in each permuted dataset• Calculate average d(I) value for each gene• Rank d(I) values

• Make scatterplot

SAM


SAM Plot

-8

-6

-4

-2

0

2

4

6

8

10

-5 -4 -3 -2 -1 0 1 2 3 4 5

Expected

Ob

serv

ed

Signif icant: 37Median # false signif icant: 2,00000

Delta 1,20000Threshold 0,00000

SAM


T-test

Paired t-test

SAM

Parametrized : Student t-distribution

Errors normally distributed

Restricted number of repeat measurements

Impossible to evaluate assumption

No explicit assumption

Order statistics

Test statistic Assumptions Distribution H0

Errors equal variance (iid)

Less stringent assumption


Multiple testing: problem• P value: measure of significance in terms of the false positive rate

• The rate that truly null features are called significant

• Significance is 5%: on average 5% of the truly null features will be called significant (type-I error)

• Type I error: Null hypothesis rejected when it is true – ‘accidental’ low p-value – falsely declared differentially expressed = false positive

• Multiple testing: Example: 10000 genes with random expression profiles - = 5% - one would find 500 genes with a p-value lower than 5% = false positives

• Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed.

Adapted from De Smet et al


Multiple testing: solutions

• Control of the familywise error rate (FWE):

P(FP 1) – protection against type I errors• Bonferonni correction: reject null hypothesis at rejection level /N,

which guarantees that FWE = P(FP 1) < • Is OK when very few genes are expected to be actually differentially

expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’

• Rejection rate becomes very conservative• But in microarray data, usually a considerable number of genes is

actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large)

• In practice we do not have to protect against every possible FP

Better solution FDR: false positive discovery rate



• We need a sensible balance between the number of true positives and the number of false positives

• Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE:

• The false positive rate: The rate that truly null features are called significant

• The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected


)()(S

FE

TF

FEFDR

FDR


Difference p-value and FDR

• 5% FDR: 5% false positives among the features called significant

• 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant


• An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <pi

• E[F(t)] = N0pi • Estimate N0

)()(S

FE

TF

FEFDR

No real differential expressionRandomised data setUniform distribution

FN

TN

TP

FP

Rejection level

Non-accidental differential expressionSuperposition of two distribuions

)-N(1

value-p with genes ofnumber )(0


i

NppFDR ii

0.)(

Overview

MICROARRAY PREPROCESSING

• Gene expression

• Omics era

• Transcript profiling

• Experiment design

• Preprocessing

• Slide by slide normalisation

• ANOVA

• Exercises

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad...

Documents

Transcript of Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad...