Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...

Tauno Metsalu BIIT Research Group

12.03.2014

Introduc;on

•  One of the most common tasks with gene expression data is to find differen;ally expressed (DE) genes in two condi;ons

•  Various methods for RNA-‐seq data have been proposed

•  This ar;cle compares the methods both methodologically and in prac;ce

Methods compared

•  Cuffdiff •  edgeR •  DESeq •  PoissonSeq •  baySeq •  limmaQN (quan;le normaliza;on) •  limmaVoom (voom)

Methodological background

Star;ng point

•  All methods except CuffDiff start from read counts assigned to each gene (HTSeq)

•  Cuffdiff – starts from transcript level to account for different isoforms (Cufflinks)

•  Some normaliza;on is needed to take different sequencing depths into account

Normaliza;on (1)

•  DESeq calculates scaling factor – read count for each gene over geometric mean of all read counts, and then takes median

•  Cuffdiff – similar, but performs intra-‐condi;on scaling first and then inter-‐condi;ons; it also uses transcript-‐specific normaliza;on addi;onally

Normaliza;on (2)

•  edgeR uses trimmed means of M values (TMM) – weighted average of the subset of genes a\er excluding genes with high average read counts and/or large differences in expression between two experiments

•  baySeq – uses upper quar;le (75% quan;le) to normalize library sizes

Normaliza;on (3)

•  PoissonSeq – least differen;ated gene set between two condi;ons is used to compute normaliza;on factors

•  limmaQN – quan;le normaliza;on makes counts across all samples have the same empirical distribu;on

•  limmaVoom – locally weighted scaaerplot smoothing (LOWESS) to es;mate mean-‐variance rela;on and transform read counts

Sta;s;cal modeling (1)

•  edgeR – uses nega;ve binomial distribu;on as a model for read counts; overdispersion factor is es;mated using both gene-‐specific and common dispersion effect

•  DESeq – similar to edgeR, but overdispersion factor is es;mated using mean expression of a gene and biological expression variability


•  Cuffdiff – separate variance model for single-‐isoform (similar to DESeq) and mul;-‐isoform genes (mixture model of nega;ve binomials)

•  baySeq – Bayesian model of nega;ve binomial distribu;ons


•  PoissonSeq – gene counts are modeled as Poisson variable where mean depends on normalized library size, expression of a gene and correla;on of the gene with respec;ve condi;on

•  limmaQN and limmaVoom assume that the transformed values are ready for linear modeling (for using any sta;s;cal methods)

Test for differen;al expression (1)

•  edgeR and DESeq – exact test using nega;ve binomial distribu;on

•  Cuffdiff – t-‐test for mean-‐variance ra;o test sta;s;c

•  limmaQN and limmaVoom – moderated t-‐sta;s;c

•  baySeq – posterior likelihood of DE

Test for differen;al expression (2)

•  PoissonSeq – tests for the significance of the correla;on between gene and condi;on using chi-‐square distribu;on

•  All methods except PoissonSeq use standard FDR (Benjamini-‐Hochberg) whereas PoissonSeq implements a novel way of finding FDR

Performance in prac;ce

Reference datasets used

•  Sequencing Quality Control (SEQC) – Replicated samples of the human whole body reference RNA and human brain reference RNA

– Spike-‐in synthe;c oligonucleo;des with different mixing ra;os

– Roughly 1000 genes validated by TaqMan qPCR

•  Biological replicates from three cell lines (part of ENCODE project)

Features compared

•  Normaliza;on of count data •  Sensi;vity and specificity of DE detec;on •  Performance on the subset of genes that are expressed in one condi;on only

•  Effect of reduced sequencing depth and number of replicates

Correla;on with qPCR

AUC with different DE cutoffs

Distribu;on of p-‐values under null model

Signal-‐to-‐noise vs significance for genes expressed in one condi;on

False posi;ves

Sensi;vity

Overview

Summary

•  No single method was best in all comparisons •  Cuffdiff performed the worst, possibly due to normaliza;on which accounts for isoforms

•  Limma which is developed for expression microarray data had comparable performance

•  Including more replicate samples should be preferred over increasing the number of sequencing reads

Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...

Documents

Transcript of Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...