Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential...

23
Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette Plant Science Institut of Paris-Saclay (IPS2) Applied Mathematics and Informatics Unit at AgroParisTech E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 1 / 21

Transcript of Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential...

Page 1: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Key ingredients for RNA-seq differential analysisNeutral comparison study

Etienne Delannoy & Marie-Laure Martin-Magniette

Plant Science Institut of Paris-Saclay (IPS2)

Applied Mathematics and Informatics Unit at AgroParisTech

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 1 / 21

Page 2: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Objective of the differential analysis

The aim is to identify a significant difference of expressionbetween two given conditionsIt is performed with an hypothesis test based on gene expressionmeasurements

H0={There is no difference}versus

H1={There is a difference}

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 2 / 21

Page 3: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Key steps for a test procedure

Construction of a testFormulate the two hypothesesConstruct the test statisticDefine its distribution under the null hypothesisCalculate the p-valueDecide if the null hypothesis is rejected or not with respect to thevalue of the test statistic

Definition of a p-valueIt is the probability of seeing a result as extreme or more extreme thanthe observed data, when the null hypothesis is true

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 3 / 21

Page 4: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Multiple testing

The result of a test can be viewed as a random variable:0 if the result is a true positive1 if the result is a false positive

By definition, P(to be a false positive)=αIf 10.000 tests are performed at level α, then the averaged numberof false-positives is 500

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 4 / 21

Page 5: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Contingency table for multiple hypothesis testingTrue False

null hypotheses null hypotheses

Declared True Negatives False Negatives Negativesnon-significant

Declared False Positives True Positives Positivessignificant

Adjustment of the raw p-valuesFWER = P(FP > 0) (Bonferroni procedure)FDR = E(FP/P) if P > 0 or 1 otherwise (Benjamini-Hochbergprocedure)

Decision ruleA gene is declared differentially expressed if its adjusted p-value islower than a given threshold

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 5 / 21

Page 6: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

How to model RNA-seq data ?

Overdispersion between biological replicatesNegative binomiale distribution is often assumed: Y ∼ NB(µ, φ)

E(Y ) = µ

V (Y ) = µ(1 + φµ)

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 6 / 21

Page 7: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Three statistical frameworks

A negative binomiale distribution (2008)- Expression = library size ×λcondition

A NB generalized linear model (2012)- allows us to decompose the expression- each condition is described by several factors

log(λcondition) = Cst + αgenotype + βstress + γgenotype,stress

- Effect of each factor is testedA linear model (2014)

- data are transformed to work with a Gaussian- allows us to decompose the expression

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 7 / 21

Page 8: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

In practice

Do we filter genes with lowexpression (yes or no)

How to model the geneexpression (NB, GLM or LM)

Which method to estimate thevariance of the gene expression(several methods)

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 8 / 21

Page 9: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Neutral comparison study

We want to answer these questions with a large evaluation study

How the statistical models fit RNA-seq data ?→ study of the p-value distribution

Do p-values well discriminate DE and NDE genes ?→ ROC curves

Are the false-positives controlled ?→ proportion of truly NDE declared DE

Are the methods powerful (able to find the truly DE genes)→ proportion of truly DE declared DE

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 9 / 21

Page 10: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Which kind of data is relevant for an evaluation ?

Real data:More realistic... but no extensively validated data yet available

Simulated data:Truth is well-controlled... but what model should be used to simulate data? How realisticare the simulated data? How much do results depend on the modelused?

Our idea was to create synthetic data

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 10 / 21

Page 11: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Creation of synthetic datasets

H0 genes

Validated

Unknown status

H0 full dataset

H1 rich dataset

Leaves vs Leaves Buds vs Leaves

qRT-PCR

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 11 / 21

Page 12: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Creation of synthetic datasets

H0 genes

Validated

Unknown status

H0 full dataset

H1 rich dataset

random sub-selection

random sub-selection

Synthetic dataset

Leaves vs Leaves Buds vs Leaves

qRT-PCR

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 11 / 21

Page 13: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Creation of synthetic datasets

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 11 / 21

Page 14: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Definition of the truth

the set of truly DE genes251 DE genes identified by qRT-PCR among 332 randomly chosengenes

the set of truly NDE genesThe proper identification is not straightforwardDefinition of two setsNDE.union: may include some genes that are not truly NDENDE.inter: may exclude some truly NDE genes.

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 12 / 21

Page 15: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

The 3 frameworks described by 9 methods

edgeR and DESeq are NB-based method

Expression = library size × λcondition

glm edgeR and DESeq2 are GLM approaches

log(λcondition) = Cst + αtissue + βbiological replicate

limma-voom is a linear modelData are transformed with the voom method

Expression = Cst + αtissue + βbiological replicate

* All methods except DESeq are also applied on filtered data* In each method, nominal value of FDR is 5 %

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 13 / 21

Page 16: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Distribution of the p-values

MethodWhen no difference is expected, histogram of the p-values areexpected to be uniform histogramFor each synthetic dataset, 100 evaluations of the uniformdistribution of 1000 genes randomly chosen in the full H0 datasetare performed

the raw p-values are not properlycalculated (67% of tests arerejected after a strict FP control)

test statistic values are smallerfor linear or generalized linearmodels

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 14 / 21

Page 17: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Definition of a ROC curve

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 15 / 21

Page 18: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Discrimination of DE and NDE genes

Methodsort raw p-values into ascending ordercompare them with the truthconstruct a ROC curve and calculate AUCAUC close to 1 indicates a good discrimination

For linear model or glm, the AUCis high and independent of theproportion of full H0 datasets

For NB-based method, the AUCsteadily decrease with theincrease of the proportion of fullH0 dataset when it is larger than0.3-0.4

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 16 / 21

Page 19: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

FDR estimation

MethodProportion of truly NDEamong the declared DEExpected value : 5%

For NB-based method, both bounds are close to 0For DESeq2, the FDR is always lower than 5%For glm edgeR, the interval generally contains 5%For limma-voom, the FDR control is more variable but the filteringstep stabilizes its behavior

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 17 / 21

Page 20: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Are truly DE declared DE ?

MethodProportion of truly DE genes among the declared DE genes

LM or GLM based-methodsshow a high TPRFor NB-based methods, the TPRis a function of the full H0dataset proportion.The variance-mean relationshipmodeling and the data filteringseem to have only a limitedimpact.

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 18 / 21

Page 21: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Conclusions

modeling ≥ filtering ≥ dispersion

Synthetic data are a relevant frameworkForget edgeR and DESequse glm edgeR, DESeq2 or limma-voominclude biological replicate as a factorfiltering allows methods to control FDR

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 19 / 21

Page 22: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Definition of an indicator of quality

An histogram with a peak at the right side = analysis of bad qualityLet’s play a game : which analysis is correct ?

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 20 / 21

Page 23: Key ingredients for RNA-seq differential analysis ... · Key ingredients for RNA-seq differential analysis Neutral comparison study Etienne Delannoy & Marie-Laure Martin-Magniette

Acknowledgements

Guillem Rigaill (IPS2, Genomic networks, Paris-Saclay)

The transcriptomic platform of IPS2 (data generation andbioinformtics analysis)

The ANR project MixStatSeq coordinated by C. Maugis (IMT,Toulouse) and involving A. Rau (GABI, INRA) and G. Celeux(INRIA, Saclay)

E. Delannoy & M.-L. Martin-Magniette Differential analysis INRA 21 / 21