Differential expression in RNA-Seq

32
Introduction Normalization Differential Expression Differential Expression in RNA-Seq Sonia Tarazona [email protected] Data analysis workshop for massive sequencing data Granada, June 2011 Sonia Tarazona Differential Expression in RNA-Seq

description

Differential expression in RNA-SeqSonia Tarazona -Massive sequencing data analysis workshop -Granada 2011

Transcript of Differential expression in RNA-Seq

Page 1: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Differential Expression in RNA-Seq

Sonia [email protected]

Data analysis workshop for massive sequencing dataGranada, June 2011

Sonia Tarazona Differential Expression in RNA-Seq

Page 2: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Outline

1 Introduction

2 Normalization

3 Differential Expression

Sonia Tarazona Differential Expression in RNA-Seq

Page 3: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

Outline

1 Introduction

Some questions

Some definitions

RNA-seq expression data

2 Normalization

3 Differential Expression

Sonia Tarazona Differential Expression in RNA-Seq

Page 4: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

Some questions

How do we measure expression?

What is differential expression?

Experimental design in RNA-Seq

Can we used the same statistics as in microarrays?

Do I need any “normalization”?

Sonia Tarazona Differential Expression in RNA-Seq

Page 5: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

Some definitions

Expression level

RNA-Seq: The number of reads (counts) mapping to the biological feature ofinterest (gene, transcript, exon, etc.) is considered to be linearlyrelated to the abundance of the target feature.

Microarrays: The abundance of each sequence is a function of the fluorescencelevel recovered after the hybridization process.

Sonia Tarazona Differential Expression in RNA-Seq

Page 6: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

Some definitions

Sequencing depth: Total number of reads mapped to the genome. Library size.

Gene length: Number of bases.

Gene counts: Number of reads mapping to that gene (expression measurement).

Sonia Tarazona Differential Expression in RNA-Seq

Page 7: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

Some definitions

What is differential expression?

A gene is declared differentially expressed if an observed difference or change inread counts between two experimental conditions is statistically significant, i.e.whether it is greater than what would be expected just due to natural randomvariation.

Statistical tools are needed to make such a decision by studying countsprobability distributions.

Sonia Tarazona Differential Expression in RNA-Seq

Page 8: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Some questionsSome definitionsRNA-seq expression data

RNA-Seq expression data

Experimental design

Pairwise comparisons: Only two experimental conditions or groups are to becompared.

Multiple comparisons: More than two conditions or groups.

Replication

Biological replicates. To draw general conclusions: from samples to population.

Technical replicates. Conclusions are only valid for compared samples.

Sonia Tarazona Differential Expression in RNA-Seq

Page 9: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Outline

1 Introduction

2 Normalization

Why?

Methods

Examples

3 Differential Expression

Sonia Tarazona Differential Expression in RNA-Seq

Page 10: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Why Normalization?

RNA-seq biases

Influence of sequencing depth: The higher sequencing depth, the higher counts.

Dependence on gene length: Counts are proportional to the transcript lengthtimes the mRNA expression level.

Differences on the counts distribution among samples.

Sonia Tarazona Differential Expression in RNA-Seq

Page 11: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Why Normalization?

RNA-seq biases

Influence of sequencing depth: The higher sequencing depth, the higher counts.

Dependence on gene length: Counts are proportional to the transcript lengthtimes the mRNA expression level.

Differences on the counts distribution among samples.

Sonia Tarazona Differential Expression in RNA-Seq

Page 12: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Why Normalization?

RNA-seq biases

Influence of sequencing depth: The higher sequencing depth, the higher counts.

Dependence on gene length: Counts are proportional to the transcript lengthtimes the mRNA expression level.

Differences on the counts distribution among samples.

Sonia Tarazona Differential Expression in RNA-Seq

Page 13: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Why Normalization?

RNA-seq biases

Influence of sequencing depth: The higher sequencing depth, the higher counts.

Dependence on gene length: Counts are proportional to the transcript lengthtimes the mRNA expression level.

Differences on the counts distribution among samples.

Options

1 Normalization: Counts should be previously corrected in order to minimize thesebiases.

2 Statistical model should take them into account.

Sonia Tarazona Differential Expression in RNA-Seq

Page 14: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Some Normalization Methods

RPKM (Mortazavi et al., 2008): Counts are divided by the transcript length(kb) times the total number of millions of mapped reads.

RPKM =number of reads of the region

total reads1000000

× region length1000

Upper-quartile (Bullard et al., 2010): Counts are divided by upper-quartile ofcounts for transcripts with at least one read.

TMM (Robinson and Oshlack, 2010): Trimmed Mean of M values.

Quantiles, as in microarray normalization (Irizarry et al., 2003).

FPKM (Trapnell et al., 2010): Instead of counts, Cufflinks software generatesFPKM values (Fragments Per Kilobase of exon per Million fragments mapped)to estimate gene expression, which are analogous to RPKM.

Sonia Tarazona Differential Expression in RNA-Seq

Page 15: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Example on RPKM normalization

Sequencing depth

Marioni et al., 2008

Kidney: 9.293.530 reads

Liver: 8.361.601 reads

RPKM normalization

Length: 1500 bases.

Kidney: RPKM = 6209293530

106 × 15001000

= 44,48

Liver: RPKM = 7468361601

106 × 15001000

= 59,48

Sonia Tarazona Differential Expression in RNA-Seq

Page 16: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

Why?MethodsExamples

Example

Sonia Tarazona Differential Expression in RNA-Seq

Page 17: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Outline

1 Introduction

2 Normalization

3 Differential Expression

Methods

NOISeq

Exercises

Concluding

Sonia Tarazona Differential Expression in RNA-Seq

Page 18: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Differential Expression

Parametric approaches

Counts are modeled using known probability distributions such as Binomial, Poisson,Negative Binomial, etc.

R packages in Bioconductor:

edgeR (Robinson et al., 2010): Exact test based on Negative Binomialdistribution.

DESeq (Anders and Huber, 2010): Exact test based on Negative Binomialdistribution.

DEGseq (Wang et al., 2010): MA-plots based methods (MATR and MARS),assuming Normal distribution for M|A.

baySeq (Hardcastle et al., 2010): Estimation of the posterior likelihood ofdifferential expression (or more complex hypotheses) via empirical Bayesianmethods using Poisson or NB distributions.

Sonia Tarazona Differential Expression in RNA-Seq

Page 19: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Differential Expression

Non-parametric approaches

No assumptions about data distribution are made.

Fisher’s exact test (better with normalized counts).

cuffdiff (Trapnell et al., 2010): Based on entropy divergence for relativetranscript abundances. Divergence is a measurement of the ”distance”betweenthe relative abundances of transcripts in two difference conditions.

NOISeq (Tarazona et al., coming soon) −→ http://bioinfo.cipf.es/noiseq/

Sonia Tarazona Differential Expression in RNA-Seq

Page 20: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Differential Expression

Drawbacks of differential expression methods

Parametric assumptions: Are they fulfilled?

Need of replicates.

Problems to detect differential expression in genes with low counts.

NOISeq

Non-parametric method.

No need of replicates.

Less influenced by sequencing depth or number of counts.

Sonia Tarazona Differential Expression in RNA-Seq

Page 21: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Differential Expression

Drawbacks of differential expression methods

Parametric assumptions: Are they fulfilled?

Need of replicates.

Problems to detect differential expression in genes with low counts.

NOISeq

Non-parametric method.

No need of replicates.

Less influenced by sequencing depth or number of counts.

Sonia Tarazona Differential Expression in RNA-Seq

Page 22: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq

Outline

Signal distribution. Computing changes in expression of each gene between thetwo experimental conditions.

Noise distribution. Distribution of changes in expression values when comparingreplicates within the same condition.

Differential expression. Comparing signal and noise distributions to determinedifferentially expressed genes.

Sonia Tarazona Differential Expression in RNA-Seq

Page 23: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq

NOISeq-real

Replicates are available for each condition.

Compute M-D in noise by comparing each pair of replicates within the samecondition.

NOISeq-sim

No replicates are available at all.

NOISeq simulates technical replicates for each condition. The replicates aregenerated from a multinomial distribution taking the counts in the only sampleas the probabilities for the distribution.

Compute M-D in noise by comparing each pair of simulated replicates within thesame condition.

Sonia Tarazona Differential Expression in RNA-Seq

Page 24: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq

NOISeq-real

Replicates are available for each condition.

Compute M-D in noise by comparing each pair of replicates within the samecondition.

NOISeq-sim

No replicates are available at all.

NOISeq simulates technical replicates for each condition. The replicates aregenerated from a multinomial distribution taking the counts in the only sampleas the probabilities for the distribution.

Compute M-D in noise by comparing each pair of simulated replicates within thesame condition.

Sonia Tarazona Differential Expression in RNA-Seq

Page 25: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq: Signal distribution

1 Calculate the expression of each gene at each experimental condition

(expressioni , for i = 1, 2).

Technical replicates: expressioni = sum(valuesi )Biological replicates: expressioni = mean(valuesi )No replicates: expressioni = valuei

2 Compute for each gene the statistics measuring changes in expression:M = log2

expression1expression2

D = |expression1–expression2 |

Sonia Tarazona Differential Expression in RNA-Seq

Page 26: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq: Noise distribution

With replicates (NOISeq-real): For each condition, all the possible comparisonsamong the available replicates are used to compute M and D values. All theM-D values for all the comparisons, genes and for both conditions are pooledtogether to create the noise distribution.If the number of pairwise comparisons for a certain condition is higher than 30,only 30 randomly selected comparisons are made.

Without replicates (NOISeq-sim): Replicates are simulated and the sameprocedure is used to derive noise distribution.

Sonia Tarazona Differential Expression in RNA-Seq

Page 27: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq: Differential expression

Probability for each gene of being differentially expressed: It is obtained bycomparing M-D values of that gene against noise distribution and computing thenumber of cases in which the values in noise are lower than the values for signal.

A gene is declared as differentially expressed if this probability is higher than q.

The threshold q is set to 0.8 by default, since this value is equivalent to an oddsof 4:1 (the gene in 4 times more likely to be differentially expressed than not).

Sonia Tarazona Differential Expression in RNA-Seq

Page 28: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq

Input

Data: datos1, datos2

Features length: long (only if length correction is to be applied)

Normalization: norm = {“rpkm”, “uqua”, “tmm”, “none”}; lc = “lengthcorrection”

Replicates: repl = {“tech”, “bio”}Simulation: nss = “number of replicates to be simulated”; pnr = “total countsin each simulated replicate”; v = variability for pnr

Probability cutoff: q (≥ 0,8)

Others: k = 0,5

Sonia Tarazona Differential Expression in RNA-Seq

Page 29: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

NOISeq

Output

Differential expression probability. For each feature, probability of beingdifferentially expressed.

Differentially expressed features. List of features names which are differentiallyexpressed according to q cutoff.

M-D values. For signal (between conditions and for each feature) and for noise(among replicates within the same condition, pooled).

Sonia Tarazona Differential Expression in RNA-Seq

Page 30: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Exercises

Execute in an R terminal the code provided in exerciseDEngs.r

Reading data

simCount <- readData(file = "simCount.txt", cond1 = c(2:6),

cond2 = c(7:11), header = TRUE)

lapply(simCount, head)

depth <- as.numeric(sapply(simCount, colSums))

Differential Expression by NOISeq

NOISeq-realres1noiseq <- noiseq(simCount[[1]], simCount[[2]], nss = 0, q = 0.8,

repl = ‘‘tech’’)

NOISeq-simres2noiseq <- noiseq(as.matrix(rowSums(simCount[[1]])),

as.matrix(rowSums(simCount[[2]])), q = 0.8, pnr = 0.2, nss = 5)

See the tutorial in http://bioinfo.cipf.es/noiseq/ for more information.

Sonia Tarazona Differential Expression in RNA-Seq

Page 31: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

Some remarks

Experimental design is decisive to answer correctly your biological questions.

Differential expression methods for RNA-Seq data must be different tomicroarray methods.

Normalization should be applied to raw counts, at least a library size correction.

Sonia Tarazona Differential Expression in RNA-Seq

Page 32: Differential expression in RNA-Seq

IntroductionNormalization

Differential Expression

MethodsNOISeqExercisesConcluding

For Further Reading

Oshlack, A., Robinson, M., and Young, M. (2010) From RNA-seq reads todifferential expression results. Genome Biology , 11, 220+.Review on RNA-Seq, including differential expression.

Auer, P.L., and Doerge R.W. (2010) Statistical Design and Analysis of RNASequencing Data. Genetics, 185, 405-416.

Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010) Evaluation ofstatistical methods for normalization and differential expression in mRNA-Seqexperiments. BMC Bioinformatics, 11, 94+.Normalization methods (including Upper Quartile) and differential expression.

Tarazona, S., Garcıa-Alcalde, F., Dopazo, J., Ferrer A., and Conesa, A.(submitted) Differential expression in RNA-seq: a matter of depth.

Sonia Tarazona Differential Expression in RNA-Seq