Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...

24
Tauno Metsalu BIIT Research Group 12.03.2014

Transcript of Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& •...

Page 1: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Tauno  Metsalu  BIIT  Research  Group  

12.03.2014  

Page 2: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Introduc;on  

•  One  of  the  most  common  tasks  with  gene  expression  data  is  to  find  differen;ally  expressed  (DE)  genes  in  two  condi;ons  

•  Various  methods  for  RNA-­‐seq  data  have  been  proposed  

•  This  ar;cle  compares  the  methods  both  methodologically  and  in  prac;ce  

Page 3: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Methods  compared  

•  Cuffdiff  •  edgeR  •  DESeq  •  PoissonSeq  •  baySeq  •  limmaQN  (quan;le  normaliza;on)  •  limmaVoom  (voom)  

Page 4: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Methodological  background  

Page 5: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Star;ng  point  

•  All  methods  except  CuffDiff  start  from  read  counts  assigned  to  each  gene  (HTSeq)  

•  Cuffdiff  –  starts  from  transcript  level  to  account  for  different  isoforms  (Cufflinks)  

•  Some  normaliza;on  is  needed  to  take  different  sequencing  depths  into  account  

Page 6: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Normaliza;on  (1)  

•  DESeq  calculates  scaling  factor  –  read  count  for  each  gene  over  geometric  mean  of  all  read  counts,  and  then  takes  median  

•  Cuffdiff  –  similar,  but  performs  intra-­‐condi;on  scaling  first  and  then  inter-­‐condi;ons;  it  also  uses  transcript-­‐specific  normaliza;on  addi;onally  

Page 7: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Normaliza;on  (2)  

•  edgeR  uses  trimmed  means  of  M  values  (TMM)  –  weighted  average  of  the  subset  of  genes  a\er  excluding  genes  with  high  average  read  counts  and/or  large  differences  in  expression  between  two  experiments  

•  baySeq  –  uses  upper  quar;le  (75%  quan;le)  to  normalize  library  sizes  

Page 8: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Normaliza;on  (3)  

•  PoissonSeq  –  least  differen;ated  gene  set  between  two  condi;ons  is  used  to  compute  normaliza;on  factors  

•  limmaQN  –  quan;le  normaliza;on  makes  counts  across  all  samples  have  the  same  empirical  distribu;on  

•  limmaVoom  –  locally  weighted  scaaerplot  smoothing  (LOWESS)  to  es;mate  mean-­‐variance  rela;on  and  transform  read  counts  

Page 9: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Sta;s;cal  modeling  (1)  

•  edgeR  –  uses  nega;ve  binomial  distribu;on  as  a  model  for  read  counts;  overdispersion  factor  is  es;mated  using  both  gene-­‐specific  and  common  dispersion  effect  

•  DESeq  –  similar  to  edgeR,  but  overdispersion  factor  is  es;mated  using  mean  expression  of  a  gene  and  biological  expression  variability  

Page 10: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Sta;s;cal  modeling  (2)  

•  Cuffdiff  –  separate  variance  model  for  single-­‐isoform  (similar  to  DESeq)  and  mul;-­‐isoform  genes  (mixture  model  of  nega;ve  binomials)  

•  baySeq  –  Bayesian  model  of  nega;ve  binomial  distribu;ons  

Page 11: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Sta;s;cal  modeling  (3)  

•  PoissonSeq  –  gene  counts  are  modeled  as  Poisson  variable  where  mean  depends  on  normalized  library  size,  expression  of  a  gene  and  correla;on  of  the  gene  with  respec;ve  condi;on  

•  limmaQN  and  limmaVoom  assume  that  the  transformed  values  are  ready  for  linear  modeling  (for  using  any  sta;s;cal  methods)  

Page 12: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Test  for  differen;al  expression  (1)  

•  edgeR  and  DESeq  –  exact  test  using  nega;ve  binomial  distribu;on  

•  Cuffdiff  –  t-­‐test  for  mean-­‐variance  ra;o  test  sta;s;c  

•  limmaQN  and  limmaVoom  –  moderated  t-­‐sta;s;c  

•  baySeq  –  posterior  likelihood  of  DE  

Page 13: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Test  for  differen;al  expression  (2)  

•  PoissonSeq  –  tests  for  the  significance  of  the  correla;on  between  gene  and  condi;on  using  chi-­‐square  distribu;on  

•  All  methods  except  PoissonSeq  use  standard  FDR  (Benjamini-­‐Hochberg)  whereas  PoissonSeq  implements  a  novel  way  of  finding  FDR  

Page 14: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Performance  in  prac;ce  

Page 15: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Reference  datasets  used  

•  Sequencing  Quality  Control  (SEQC)  – Replicated  samples  of  the  human  whole  body  reference  RNA  and  human  brain  reference  RNA  

– Spike-­‐in  synthe;c  oligonucleo;des  with  different  mixing  ra;os    

– Roughly  1000  genes  validated  by  TaqMan  qPCR  

•  Biological  replicates  from  three  cell  lines  (part  of  ENCODE  project)  

Page 16: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Features  compared  

•  Normaliza;on  of  count  data  •  Sensi;vity  and  specificity  of  DE  detec;on  •  Performance  on  the  subset  of  genes  that  are  expressed  in  one  condi;on  only  

•  Effect  of  reduced  sequencing  depth  and  number  of  replicates  

Page 17: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Correla;on  with  qPCR  

Page 18: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

AUC  with  different  DE  cutoffs  

Page 19: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Distribu;on  of  p-­‐values    under  null  model  

Page 20: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Signal-­‐to-­‐noise  vs  significance  for  genes  expressed  in  one  condi;on  

Page 21: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

False  posi;ves  

Page 22: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Sensi;vity  

Page 23: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Overview  

Page 24: Tauno&Metsalu& BIIT&Research&Group& 12.03 · 2014. 3. 18. · Stas;cal&modeling&(3)& • PoissonSeq&–gene&counts&are&modeled&as& Poisson&variable&where&mean&depends&on& normalized&library&size,&expression&of&agene&

Summary  

•  No  single  method  was  best  in  all  comparisons  •  Cuffdiff  performed  the  worst,  possibly  due  to  normaliza;on  which  accounts  for  isoforms  

•  Limma  which  is  developed  for  expression  microarray  data  had  comparable  performance  

•  Including  more  replicate  samples  should  be  preferred  over  increasing  the  number  of  sequencing  reads