Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...
Transcript of Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...
Tauno Metsalu BIIT Research Group
12.03.2014
Introduc;on
• One of the most common tasks with gene expression data is to find differen;ally expressed (DE) genes in two condi;ons
• Various methods for RNA-‐seq data have been proposed
• This ar;cle compares the methods both methodologically and in prac;ce
Methods compared
• Cuffdiff • edgeR • DESeq • PoissonSeq • baySeq • limmaQN (quan;le normaliza;on) • limmaVoom (voom)
Methodological background
Star;ng point
• All methods except CuffDiff start from read counts assigned to each gene (HTSeq)
• Cuffdiff – starts from transcript level to account for different isoforms (Cufflinks)
• Some normaliza;on is needed to take different sequencing depths into account
Normaliza;on (1)
• DESeq calculates scaling factor – read count for each gene over geometric mean of all read counts, and then takes median
• Cuffdiff – similar, but performs intra-‐condi;on scaling first and then inter-‐condi;ons; it also uses transcript-‐specific normaliza;on addi;onally
Normaliza;on (2)
• edgeR uses trimmed means of M values (TMM) – weighted average of the subset of genes a\er excluding genes with high average read counts and/or large differences in expression between two experiments
• baySeq – uses upper quar;le (75% quan;le) to normalize library sizes
Normaliza;on (3)
• PoissonSeq – least differen;ated gene set between two condi;ons is used to compute normaliza;on factors
• limmaQN – quan;le normaliza;on makes counts across all samples have the same empirical distribu;on
• limmaVoom – locally weighted scaaerplot smoothing (LOWESS) to es;mate mean-‐variance rela;on and transform read counts
Sta;s;cal modeling (1)
• edgeR – uses nega;ve binomial distribu;on as a model for read counts; overdispersion factor is es;mated using both gene-‐specific and common dispersion effect
• DESeq – similar to edgeR, but overdispersion factor is es;mated using mean expression of a gene and biological expression variability
Sta;s;cal modeling (2)
• Cuffdiff – separate variance model for single-‐isoform (similar to DESeq) and mul;-‐isoform genes (mixture model of nega;ve binomials)
• baySeq – Bayesian model of nega;ve binomial distribu;ons
Sta;s;cal modeling (3)
• PoissonSeq – gene counts are modeled as Poisson variable where mean depends on normalized library size, expression of a gene and correla;on of the gene with respec;ve condi;on
• limmaQN and limmaVoom assume that the transformed values are ready for linear modeling (for using any sta;s;cal methods)
Test for differen;al expression (1)
• edgeR and DESeq – exact test using nega;ve binomial distribu;on
• Cuffdiff – t-‐test for mean-‐variance ra;o test sta;s;c
• limmaQN and limmaVoom – moderated t-‐sta;s;c
• baySeq – posterior likelihood of DE
Test for differen;al expression (2)
• PoissonSeq – tests for the significance of the correla;on between gene and condi;on using chi-‐square distribu;on
• All methods except PoissonSeq use standard FDR (Benjamini-‐Hochberg) whereas PoissonSeq implements a novel way of finding FDR
Performance in prac;ce
Reference datasets used
• Sequencing Quality Control (SEQC) – Replicated samples of the human whole body reference RNA and human brain reference RNA
– Spike-‐in synthe;c oligonucleo;des with different mixing ra;os
– Roughly 1000 genes validated by TaqMan qPCR
• Biological replicates from three cell lines (part of ENCODE project)
Features compared
• Normaliza;on of count data • Sensi;vity and specificity of DE detec;on • Performance on the subset of genes that are expressed in one condi;on only
• Effect of reduced sequencing depth and number of replicates
Correla;on with qPCR
AUC with different DE cutoffs
Distribu;on of p-‐values under null model
Signal-‐to-‐noise vs significance for genes expressed in one condi;on
False posi;ves
Sensi;vity
Overview
Summary
• No single method was best in all comparisons • Cuffdiff performed the worst, possibly due to normaliza;on which accounts for isoforms
• Limma which is developed for expression microarray data had comparable performance
• Including more replicate samples should be preferred over increasing the number of sequencing reads