Introduction to DESeq and edgeR packages

19
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen

description

Introduction to DESeq and edgeR packages. Peter A.C. ’ t Hoen. Poisson distribution. - PowerPoint PPT Presentation

Transcript of Introduction to DESeq and edgeR packages

Page 1: Introduction to DESeq and edgeR packages

Introduction to DESeq and edgeR packages

Peter A.C. ’t Hoen

Page 2: Introduction to DESeq and edgeR packages

Poisson distribution

• discrete probability distribution that expresses the probability

of a number of events occurring in a fixed period of time if

these events occur with a known average rate and

independently of the time since the last event

= expected k = number of occurrences

Page 3: Introduction to DESeq and edgeR packages

Count process

• Poisson distribution

Yt ~ Poisson(λt) with λt = pnt

t: tag

λ: true expression

Y: observed expression

p: probability

n: total number of RNA molecules

• Truncated Poisson distribution: zero can mean not expressed or not counted

• Count variance ~ λt

• Murray F Freeman and John W Tukey. Ann Math Statist, 21:607-611, (1950)

Page 4: Introduction to DESeq and edgeR packages

Negative binomial distribution

• discrete probability distribution of the number of successes in

a sequence of Bernoulli trials before a specified (non-random)

number r of failures occurs

• also arises as a continuous mixture of Poisson distributions

where the mixing distribution of the Poisson rate is a gamma

distribution. That is, we can view the negative binomial as a

Poisson(λ) distribution, where λ is itself a random variable,

distributed according to Gamma(r, p/(1 − p)).

Page 5: Introduction to DESeq and edgeR packages

edgeR (1)

• Robinson, Smyth (Biostatistics, 2008; Bioinformatics 2007)

• Package available from Bioconductor with very informative

vignette

Yij ~ NB (ij , )

Var (Yij) = ij ( 1 + ij x )

• Negative binomial (gamma Poisson) with average mu

• Phi is overdispersion parameter (biological variation)

• = 0 gives Poisson distribution

Page 6: Introduction to DESeq and edgeR packages

Overdispersion in our data

Page 7: Introduction to DESeq and edgeR packages

edgeR (2)

• Test per gene

Ygij ~ NB (gij , g ) where gij = Mj x pgj

Var (Ygij) = gij ( 1 + ij x g)

pgi is proportion of tags for tag g in sample i

Mj is library size for sample i and library j

g is dispersion parameter for tag g

Page 8: Introduction to DESeq and edgeR packages

edgeR (3)

• Estimation of common dispersion parameter by conditioning

g on the sum of counts and maximizing the common

likelihood

lC() = lg (g)

• Common dispersion parameter OR weighted linear

combination of common and individual likelihoods

WL (g) = lg(g) + lC(g)

Page 9: Introduction to DESeq and edgeR packages

edgeR (4)

• Exact test replacing hypergeometric probabilities with NB-

derived probabilities (qCML) for single factor experiment

• Generalized linear models and Cox-Reid profile-adjusted

likelihood (CR) method for multifactorial experiments

Page 10: Introduction to DESeq and edgeR packages

edgeR: what is new?

• Exact Test not able to work with confounders

replaced by generalized linear model with log likelihood

ratio test

• Abundance trending in dispersion estimates

Page 11: Introduction to DESeq and edgeR packages

Dispersion trend

dispersion

abundance

Page 12: Introduction to DESeq and edgeR packages

Dispersion trending (after filtering for low ab)

dispersion

abundance

Page 13: Introduction to DESeq and edgeR packages

DESeq (1)

• Anders and Huber: Genome Biology (2010) 11:R106

• Roughly same principles as edgeR

• No multifactorial analysis implemented yet

Page 14: Introduction to DESeq and edgeR packages

DESeq (2)

(1) Yij ~ NB (ij , σ2ij )

(2) ij = sj qi,ρ(j) sj scaling factor for sample j

qi,ρ(j) proportional concentration

of tag i in condition ρ

(3) σ2ij = ij + s2

j νi,ρ(j) νi,ρ(j) is a smooth function

depending on qi,ρ(j) (concentration)

Count noise Extra variance

Page 15: Introduction to DESeq and edgeR packages

DESeq (3): variance trend with expression

Purple: PoissonDashed orange: edgeR (before trending)Orange: DESeq

You can derive:Squared CV is 1/μ + φ

Page 16: Introduction to DESeq and edgeR packages

DESeq (3)

• Differences with edgeR:

• Complete shrinkage to trended dispersion; limited tagwise

dispersion estimates

• Different variance estimates for different sample groups allowed

• Deals better with samples with large differences in read depth?

Page 17: Introduction to DESeq and edgeR packages

DESeq (4): statistical testing

• In analogy to initial edgeR implementation exact test on the

NB probabilities in the two conditions

Page 18: Introduction to DESeq and edgeR packages

Conclusions

• edgeR and DESeq are comparable implementation of

statistical tests using NB distribution

• edgeR and DESeq produce largely similar results

• Implementation of generalized linear models in edgeR allows

for testing with confounders

• Results comparable to limma for medium – high expressed

genes: modeling of stochastic effects is particularly important

for low expressed genes

Page 19: Introduction to DESeq and edgeR packages

Comparison to limma (on sqrt scaled data)