Summary$ - Michigan State Universityrdp.cme.msu.edu/download/posters/fungalITSreport_062014.pdf ·...

8
RDP TECHNICAL REPORT Created 04/12/2014, Updated 08/08/2014 Comparison of Three Fugal ITS Reference Sets Qiong Wang and Jim R. Cole [email protected], [email protected] Summary In this report, we evaluate the performance of three different fungal ITS datasets using RDP Classifier (1). The genera covered differed significantly between the three sets. The UNITE was the largest dataset, covered at least 85% and 73% genera from DOE_SFA and Warcup, respectively. DOE_SFA is the smallest, containing only 48% and 39% of the Warcup and UNITE genera, respectively. Warcup showed the highest and tightest similarity within species with median at 96%. UNITE_sh (grouping by UNITE “species hypothesis” accession code) has a median similarity within species of 90%. Warcup and UNITE_sh performed similarly during the leaveonesequenceout testing: 85%, 88% accuracy at species and 93%, 90% at genus respectively. Warcup showed the best accuracy during leaveone taxonout testing, with UNITE_sh the second best. It took 80 seconds to classify 1000 nearfull length ITS sequences using the UNITE_sh training set on a single CPU on Mac 3.2 GHz Intel Core i5 processor. Using the Warcup training set, the speed was twice as fast, roughly proportional to the relative number of species. When trained on UNITE_name set in which sequences were grouped by UNITE taxon names, the Classifier performed much worse than when trained on the UNITE_sh. Both the Warcup and UNITE_sh ITS training sets are available on RDP Classifier web site, and RDP SourceForge repository (http://sourceforge.net/projects/rdp classifier/) and GitHub repository (and https://github.com/rdpstaff). ITS Reference Sets DOE_SFA ref set: This is a published handcurated set. The sequences and taxonomy construction of this set were described in detail in PorrasAlfaro et al. (U.S. Department of Energy Science Focus Area; 2). It contains lineage only to the genus level. Briefly, the majority of sequences were selected from published phylogenies or from NCBI searches. It only contains lineage information down to genus level. Warcup ref set: An version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (manuscript in preparation). It also incorporates some training sequences from DOE_SFA and UNITE ref sets. It contains lineages to the species level.

Transcript of Summary$ - Michigan State Universityrdp.cme.msu.edu/download/posters/fungalITSreport_062014.pdf ·...

RDP  TECHNICAL  REPORT    Created  04/12/2014,  Updated  08/08/2014      

Comparison  of  Three  Fugal  ITS  Reference  Sets    

  Qiong  Wang  and  Jim  R.  Cole     [email protected],  [email protected]  

   Summary    In  this  report,  we  evaluate  the  performance  of  three  different  fungal  ITS  datasets  using  RDP  Classifier  (1).  The  genera  covered  differed  significantly  between  the  three  sets.  The  UNITE  was  the  largest  dataset,  covered  at  least  85%  and  73%  genera  from  DOE_SFA  and  Warcup,  respectively.  DOE_SFA  is  the  smallest,  containing  only  48%  and  39%  of  the  Warcup  and  UNITE  genera,  respectively.      Warcup  showed  the  highest  and  tightest  similarity  within  species  with  median  at  96%.  UNITE_sh  (grouping  by  UNITE  “species  hypothesis”  accession  code)  has  a  median  similarity  within  species  of  90%.  Warcup  and  UNITE_sh  performed  similarly  during  the  leave-­‐one-­‐sequence-­‐out  testing:  85%,  88%  accuracy  at  species  and  93%,  90%  at  genus  respectively.  Warcup  showed  the  best  accuracy  during  leave-­‐one-­‐taxon-­‐out  testing,  with  UNITE_sh  the  second  best.    It  took  80  seconds  to  classify  1000  near-­‐full  length  ITS  sequences  using  the  UNITE_sh  training  set  on  a  single  CPU  on  Mac  3.2  GHz  Intel  Core  i5  processor.  Using  the  Warcup  training  set,  the  speed  was  twice  as  fast,  roughly  proportional  to  the  relative  number  of  species.  When  trained  on  UNITE_name  set  in  which  sequences  were  grouped  by  UNITE  taxon  names,  the  Classifier  performed  much  worse  than  when  trained  on  the  UNITE_sh.      Both  the  Warcup  and  UNITE_sh  ITS  training  sets  are  available  on  RDP  Classifier  web  site,  and  RDP  SourceForge  repository  (http://sourceforge.net/projects/rdp-­‐classifier/)  and  GitHub  repository  (and  https://github.com/rdpstaff).    ITS  Reference  Sets    DOE_SFA  ref  set:  This  is  a  published  hand-­‐curated  set.  The  sequences  and  taxonomy  construction  of  this  set  were  described  in  detail  in  Porras-­‐Alfaro  et  al.  (U.S.  Department  of  Energy  Science  Focus  Area;  2).  It  contains  lineage  only  to  the  genus  level.  Briefly,  the  majority  of  sequences  were  selected  from  published  phylogenies  or  from  NCBI  searches.  It  only  contains  lineage  information  down  to  genus  level.    Warcup  ref  set:  An  version  from  an  active  curatorial  effort  kindly  provided  by  Paul  Greenfield  and  Vinita  Deshpande  of  the  Australian  Commonwealth  Scientific  and  Industrial  Research  Organization  (manuscript  in  preparation).  It  also  incorporates  some  training  sequences  from  DOE_SFA  and  UNITE  ref  sets.  It  contains  lineages  to  the  species  level.      

   UNITE  ref  set:  A  set  consisting  of  UNITE  core  sequences  (excluding  chimeric  and  low  quality)  for  each  dynamic  species  hypothesis  provided  by  Kessy  Abarenkov  of  UNITE  on  July  4,  2014.  This  file  uses  the  UNITE    “dynamic  species  hypotheses”.  These  were  created  using  a  two-­‐tier  clustering  process,  which  first  cluster  sequences  to  subgenus/genus  level  and  then  to  finer  species  level  (3).    In  addition  to  the  UNITE  “species  hypothesis”  accession  code  number,  each  sequence  is  labeled  with  a  lineage  including  a  more  traditional  “UNITE  taxon  name”  as  species  designation.  We  tested  the  UNITE  set  twice  once  grouping  by  UNITE  taxon  name  as  terminal  taxa  (UNITE_name)  and  a  second  time,  using  a  concatenation  of  the  UNITE  “Species  hypotheses”  and  UNITE  taxon  name  to  group  sequences  into  terminal  taxa  (UNITE_sh).  For  example,  instead  of  having  one  terminal  taxon  Cortinarius_caesiocortinatus,  this  set  has  two  terminal  taxa  “Cortinarius_caesiocortinatus|SH192002.06FU”  and  “Cortinarius_caesiocortinatus|SH192062.06FU”.  Except  the  grouping  of  sequences  into  terminal  taxa,  the  sequences  included  in  these  two  UNITE  ref  sets  are  identical.      For  each  of  the  ref  sets  described  above,  we  constructed  a  unique  set  by  removing  any  sequence  identical  to,  or  a  substring  of  another  sequence  in  the  same  training  set.  Removing  duplicates  is  important  for  evaluating  the  performance  of  the  dataset  to  avoid  inflated  results.  The  taxonomic  composition  and  the  number  of  sequences  are  listed  in  Table  1a.  In  addition  to  the  common  domain  Fungi,  DOE_SFA  set  contains  1  sequence  from  each  of  three  domains  Protozoa,  Viridiplantae  and  Stramenopiles;  UNITE  contains  56  sequences  from  domain  Protozoa.  Vast  majority  of  these  three  datasets  contain  sequences  of  full  ITS  regions,  including  ITS1,  5.8S  and  ITS2  (Table  1b).      Table  1a:  taxonomic  compositions  of  major  ranks    

Rank   Warcup   DOE_SFA   UNITE  domain  (kingdom)   1   2   2  phylum   8   11   10  class   40   36   45  order   131   118   167  family   364   328   523  genus   1,620   1,134   2,135  species   8,967   NA   20,221*  Unique  Sequences   17,923   6,889   145,019  

*  The  UNITE_sh  has  20,221  species  level  taxa,  the  UNITE_name  has  10,346.    

Table  1b:  Completeness  of  the  unique  sequences  Completeness  (%)   Warcup   DOE_SFA   UNITE  (Near)  complete   95.2   94.6   97.5  Incomplete  ITS1   2.4   2.7   1.2  Incomplete  ITS2   2   2.2   1.1  Incomplete  both   0.3   0.4   0.1  

Results  Commonality  We  compared  the  three  ref  sets  to  measure  the  extent  that  genera  and  sequences  were  shared  between  the  different  data  sets  (Table  2a,  2b).  UNITE  is  the  largest  set,  containing  85%  of  genera  from  Warcup  and  73%  of  genera  from  DOE_SFA.  It  also  contains  more  than  half  of  the  sequences  (Genbank  accnos)  from  Warcup  and  DOE_SFA.  Warcup  is  the  second  largest  set,  containing  69%  of  genera  from  DOE_SFA  and  64%  of  genera  from  UNITE.  The  percent  of  sequences  from  the  other  sets  found  in  either  Warcup  or  DOE_SFA  was  less  than  15%  (Table  2b).  The  number  of  shared  genera  and  shared  sequences  between  each  pair  of  ref  set  was  shown  in  Venn  diagram  (Fig.  1).      Table  2a:  Shared  genera     Warcup   DOE_SFA   UNITE  Warcup     48%   85%  DOE_SFA   69%     73%  UNITE   64%   39%      Table  2b:  Shared  Sequences     Warcup   DOE_SFA   UNITE  Warcup     6%   66%  DOE_SFA   15%     56%  UNITE   8%   3%    

 Figure  1:  Venn  diagram  of  shared  genera  and  shared  sequences.    Taxa  Similarity  We  examined  how  close  the  sequences  were  within  taxa  and  between  taxa.  Since  no  good  multiple  alignment  methods  are  available  for  ITS,  we  used  Sab  scores  as  a  measure  of  similarity  between  sequences.    

!!

109$ 646$723$

56$

Shared genera

246$ 40$

657$

DOE_SFA Warcup

UNITE

!!

Shared sequences

3068$ 11111$

796$

239$2786$ 5777$

130044$

DOE_SFA Warcup

UNITE

DOE_SFA  does  not  group  at  species  rank,  the  median  Sab  score  within  genera  is  56%  and  drops  to  31%  among  families  (Fig.  2).  Warcup  showed  the  highest  and  tightest  similarity  within  species  with  median  at  96%.  UNITE_sh  has  a  median  similarity  within  species  of  90%,  with  a  large  range  from  72%  (2nd  percentile)  to  99%  (98th  percentile).  For  both  DOE_SFA  and  UNITE_sh,  the  higher  ranks  were  slightly  less  similar  than  Warcup.  UNITE_name  has  the  lowest  median  similarity  of  37%  within  species.  The  similarity  between  species  (or  higher  ranks)  was  low  for  all  the  sets.      

           

         Figure  2:  box  and  whisker  plots  showing  intra-­‐taxa  similarity  (Sab  score)  for  each  major  rank.  The  1st  quartile,  median  and  3rd  quartiles  are  shown  as  the  bottom,  middle  and  top  of  the  box,  the  2nd  and  98th  percentiles  are  indicated  by  whiskers.  From  clockwise:  Warcup,  DOE_SFA,  UNITE_name  and  UNITE_sh.  Note  DOE_SFA  does  not  group  at  species  rank.  

 Leave-­‐One-­‐Out  Testing    We  preformed  both  leave-­‐one-­‐sequence-­‐out  and  leave-­‐one-­‐taxon-­‐out  testing  on  the  three  fungal  ITS  datasets.  All  Warcup  and  DOE_SFA  sequences  were  used  for  testing.  For  the  UNITE_sh  and  UNITE_name  sets,  one  sequence  from  each  species  was  chosen  randomly  as  query  for  these  tests.      Classification  without  bootstrap  cutoff  was  use  for  these  accuracy  measurements.  Warcup  achieved  85%  accuracy  at  species  level  and  93%  accuracy  at  genus  level.  UNITE_sh  showed  88%  at  species  and  90%  at  genus  level  (Fig.  3).  DOE_SFA  showed  only  79%  at  genus  level.  One  notable  difference  worth  mentioning  here  are  differences  between  our  testing  results  and  the  testing  results  from  the  publication  describing  DOE_SFA  dataset  (2).  Duplicate  sequences  were  not  removed  from  the  training  set  in  those  tests  while  they  were  removed  for  this  report.    When  a  taxon  was  removed  from  the  testing,  the  accuracy  at  lower  ranks  (order,  family,  genus)  decreased  for  all  the  data  sets.  For  example,  if  the  species  was  not  present  in  the  training  set,  in  73%  of  the  cases,  the  Classifier  trained  on  Warcup  set  can  assign  a  sequence  to  the  correct  genus,  but  for  only  58%  of  the  cases  when  trained  on  UNITE_sh  set.  If  the  genus  is  not  present,  Classifier  trained  on  Warcup  set  made  the  correct  family  assignment  90%  of  the  time  but  only  77%  of  the  time  when  trained  on  UNITE_sh  set  and  60%  when  trained  on  DOE_SFA  set.      When  tested  on  the  UNITE_name  ref  set  constructed  using  the  species  name  as  the  terminal  taxon  name,  the  Classifier  showed  only  74%  accuracy  at  species  level  and  80%  at  genus  level  with  leave-­‐one-­‐sequence-­‐out  testing.  The  accuracy  of  the  leave-­‐one-­‐taxon-­‐out  testing  using  UNITE_name  set  was  also  worse  than  the  one  using    UNITE_sh  set.    Further  investigating  the  misclassified  sequences  during  leave-­‐one-­‐out  testing,  we  found  they  have  the  closest  match  (highest  Sab  score)  to  a  sequence  from  a  different  species  in  the  majority  of  the  cases  (Table  3).      Table  3:  percent  of  misclassified  sequences  with  closest  matches  in  different  taxon       #  misclassified  

seqs  %  misclassified  seqs  with  closest  match  in  different  taxon  

Warcup   2920   66.3%  DOE_SFA   1347   78.6%  UNITE_sh   3337   58.2%  UNITE_name   5373   30.1%  

     

     

   

   Figure  3:  Classification  accuracy  at  each  major  taxon  rank  from  leave-­‐one-­‐out  testing.  The  RDP  Classifier  was  trained  on  the  each  of  the  four  fungal  ITS  sets.  No  bootstrap  cutoff  was  applied  in  the  accuracy  calculation.  Top:  leave-­‐one-­‐sequence-­‐out  testing.  Bottom:  leave-­‐one-­‐taxon-­‐out  testing.    Methods  Leave-­‐one-­‐sequence-­‐out  testing:  each  iteration  one  sequence  from  the  training  set  was  chosen  as  a  test  sequence.  That  sequence  was  removed  from  training  set.  The  assignment  of  the  sequence  produced  by  the  Classifier  was  compared  to  the  original  taxonomy  label  to  measure  the  accuracy  of  the  Classifier.  Singleton  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

domain  phylum   class   order   family   genus   species  

Accuracy  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

domain   phylum   class   order   family   genus  

Accuracy  

DOE_SFA  Warcup  UNITE_sh  UNITE_name  

sequences  might  be  included  in  the  accuracy  calculation  if  the  higher  rank  taxon  contained  multiple  sequences.  For  example  if  a  sequences  is  the  only  sequences  for  a  species,  it’s  not  included  in  the  accuracy  calculation  for  species  rank;  but  if  this  sequence  belonged  to  a  genus  containing  multiple  species,  then  it  was  included  in  the  accuracy  calculation  for  the  genus  rank.        Leave-­‐one-­‐taxon-­‐out  testing  is  very  similar  to  the  leave-­‐one-­‐sequence-­‐out  testing  except  for  each  test  sequence,  the  lowest  taxon  that  sequence  assigned  to  (either  species  or  genus  node)  was  removed  from  the  training  set.  This  is  intended  to  test  if  the  species  or  genus  is  no  present  in  the  training  set,  how  likely  the  Classifier  can  assign  the  sequence  to  the  correct  genus  or  higher  taxa.      Sab  score:  the  percent  of  share  8-­‐mers  between  two  sequences.  This  is  the  same  score  as  the  one  calculated  by  RDP  SeqMatch  except  the  latter  uses  7-­‐mer.  We  used  8-­‐mer  here  because  Classifier  performs  the  best  using  8-­‐mer  when  trained  on  16S  rRNA  datasets.      Taxa  Similarity:  for  each  pair  of  sequences  from  a  set,  we  calculated  the  Sab  score  and  added  score  to  the  lowest  common  ancestor  taxon  of  the  two  sequences.  For  example,  if  these  two  sequences  were  from  the  same  species,  the  Sab  score  was  added  to  species  pool  to  measure  how  close  sequences  are  within  species.  If  there  are  from  the  same  genus  but  not  from  the  same  species,  the  Sab  score  was  added  to  the  genus  pool  to  measure  how  close  they  are  between  species.  The  Sab  scores  for  each  rank  were  used  to  generate  box  and  whisker  plots.      Completeness  Measurement:  sequence  records  were  retrieved  from  Genbank  using  the  Genbank  accnos  from  all  three  ref  sets.  Only  sequences  with  feature  “internal  transcribed  spacer  1"  and  “internal  transcribed  spacer  2"  were  considered  as  complete  and  the  corresponding  sequence  region  were  kept.    The  resulted  in  13,912  complete  reference  sequences  (called  COMBO  set).  For  each  query  sequence  in  each  of  the  ref  sets,  the  pairwise  alignment  between  the  query  and  a  sequence  from  COMBO  set  with  the  best  alignment  score  was  used  to  determine  the  completeness.  A  query  is  marked  with  “Incomplete  ITS1”  if  the  query  alignment  contains  at  least  50  inserts  in  the  beginning,  or  “Incomplete  ITS2”  is  it  contains  at  least  50  inserts  at  the  end  of  the  alignment,  or  both.    References    

1. Wang,  Q,  G.  M.  Garrity,  J.  M.  Tiedje,  and  J.  R.  Cole.  2007.  Naïve  Bayesian  Classifier  for  Rapid  Assignment  of  rRNA  Sequences  into  the  New  Bacterial  Taxonomy.  Appl  Environ  Microbiol.  73(16):5261-­‐7.    

2. Porras-­‐Alfaro  A,  Liu  KL,  Kuske  CR,  Xie  G.  2014.  From  genus  to  phylum:  large-­‐subunit  and  internal  transcribed  spacer  rRNA  operon  regions  show  similar  classification  accuracies  influenced  by  database  composition.  Appl  Environ  Microbiol.  80(3):829-­‐40.  

3. Koljalg  U.,  Nilsson  R.H.,  Abarenkov  K.,  Tedersoo  L.,  Taylor  A.,  Bahram  M.,  Bates  S.T.,  Bruns  T.D.,  Bengtsson-­‐Palme  J.,  Callaghan  T.M.,  et  al.  2013.  Towards  a  unified  paradigm  for  sequence-­‐based  identification  of  fungi.  Molecular  Ecology  22:  5271–5277.