RDP Tools Gene-Targeted Analysisrdp.cme.msu.edu/download/posters/ISME2012.pdf · GF Classify 984...

1
Xander is a De Bruijn Graph assembler designed for gene-targeted meta- genomic assembly. We use a space efficient graph representation to enable scaling to large datasets. Xander is a local assembly tool; starting from a node in the graph, we walk in each direction using a Hidden Markov Model as a guide to assemble genes of interest. In order to explore population level diversity we have developed methods to find additional, sub-optimal, paths. HMP Dataset: 100M 101-bp reads, 15G metagenomic shotgun Human Gut data from an ulcerative colitis (UC) patient who underwent a colectomy followed by ileal pouch anal anastomosis. In this procedure, the entire colon is resected, the terminal ileum is fashioned into a pouch, connected to the anal canal and the intestinal flow is re-established. Gene: but (butyryl-CoA transferase) Butyrate serves as the major energy source of colonocytes, has anti-inflammatory properties, and regulates gene expression, differentiation and apoptosis in host cells. In healthy individuals the but pathway is the major pathway for butyrate production in human gut. Results: Xander searched and assembled 56 unique protein sequences with length >100. Only two nearly identical sequences were full length. These were very similar (2 and 4 AA substitutions) to a but gene from the HMP reference genome sequence of Acidaminococcus sp. D21, isolated from a healthy human gut. Sequence Nucleotide Substitutions AA Substitutions 1 4 20 V I 194 Q P 2 8 20 V I 139 V G 141 A S 194 Q P Substitutions found in the two full-length but sequences assembled by Xander FrameBot is a tool for frameshift correction and nearest- neighbor classification that uses a dynamic programming algorithm to align a query DNA sequence against a set of protein sequences. Given a DNA read and a set of known target sequences, it produces corrected protein and DNA sequences and an optimal global or local protein alignment. It also helps filter out non-target reads. The algorithm allows amino acid changes, insertions and deletions, and nucleotide insertions and deletions of one or two bases, thereby repairing frameshifts. To enhance the speed of FrameBot as a nearest neighbor classifier we employed a metric indexing strategy. By pre-computing the distance between the target sequences and applying the metric index, the number of comparisons required to find the closest reference was largely reduced. The tool was tested on 454 titanium amplicon libraries from defined communities covering portions of three genes: nitrogenase reductase ( nifH ), biphenyl dioxygenase alpha subunit ( bphA ) and butyryl- CoA:acetate CoA-transferase (but). Acknowledgements: RDP is supported by the Office of Science (BER), U.S. Dept. of Energy, Grant No. DE-FG02-99ER62848, and NIEHS Grant No. 5P42 ES004911-18/19 RDP Tools for Gene-TargetedAnalysis James Cole 1,4 *, Benli Chai 1,4 , Jordan Fish 1,4 , Qiong Wang 1,4 , Donna McGarrell 1,4 , C. Titus Brown 2,3 , Yanni Sun 2 and James Tiedje 1,3,4 1 Center for Microbial Ecology; 2 Computer Science & Engineering; 3 Microbiology & Molecular Genetics; 4 Plant, Soil and Microbial Sciences; Michigan State University, East Lansing, Michigan 48824 RDP Classifier Feature Highlights: Newly updated Bacterial and Archaeal training set version 9 now contains 10,049 training sequences. New output format offering a choice of a fixed number of ranks, removing uncommon ranks (suborder, subgenera). Command line version now has options for minimum word count, number of bootstrap tests, and ability to train on any taxonomic rank. Provide convenient leave-one-out testing package for users to check classification accuracy rate and identify errors in their own training taxonomy and sequences. Source code, compiled programs and training data are freely available at http://sourceforge.net/projects/rdp-classifier/ Available online and as a SOAP service through RDP’s website. RDP MultiClassifier: Processes multiple samples, produces taxon vs. sample spreadsheet. Format for taxonomy-supervised diversity analysis as described in Sul et al. Proc Natl Acad Sci USA. 2011 Aug 30;108(35):14637-42. Results from multiple runs can be combined into a single spreadsheet. RDP Fungal Classifier: We are excited to be able to offer the RDP Classifier trained on the fungal LSU taxonomic data published by Andrea Porras-Alfaro, Gary Xie, Cheryl Kuske and their co-workers (Liu et al., Appl Environ Microbiol. 2012 Mar;78(5): 1523-33). We expect this collaboration to lead to development of other analysis tools for the fungal community. Recently named Phylum "Armatimonadetes" replaces candidate division OP 10. To avoid confusion for those aggregating high-throughput reads from environments containing chloroplast DNA that may be amplified with universal primers, the phylum Cyanobacteria has been labeled Cyanobacteria/ Chloroplast. Under this phylum the Chloroplasts have been added as a new artificial taxon class "Chloroplast”. The Acidobacteria and Verrucomicrobia have been modified to include additional new formal taxa. Published informal taxonomies have been maintained for these groups as isolates do not yet cover much of the known diversity. The genus Clostridium has been rearranged based on the Revised Road Map to the Phylum Firmicutes from Bergey's Manual of Systematic Bacteriology to better represent the polyphyletic nature of the genus. The recently named Negativicutes class in the phylum Firmicutes has been added along with associated rearrangements of the other Firmicutes classes. An additional 39 recently named genera have been added to the taxonomy. myRDP space upload and analyze your own 16S sequences in your private space Soap Services integrate RDP tools into your pipeline Browsers browse and select from taxonomic hierarchy / powerful search and selection features RDP Classifier places sequences into bacterial and fungal taxonomies ProbeMatch fast search algorithm, limit searches to specific regions Pyro and FunGene Pipelines tools for gene-targeted metagenomics MIMARKS GoogleSheet helps organize standards-compliant metadata Each tutorial comes with sample input and output data with step-by- step instructions. Tutorials cover individual tools and complex workflow routines such as processing NGS data. Features both web and command-line tools (e.g., RDP Classifier, MultiClassifier and mcClust). Helping researchers understand best practices for using RDP tools: http://rdp.cme.msu.edu/tutorials/ SeqMatch finds nearest neighbor, more accurate than BLAST Trained on 8506 curated fungal 28S rRNA gene reference sequences, along with a hand- vetted fungal taxonomy including 1702 genera plus higher-level taxa Contact: [email protected] http://rdp.cme.msu.edu RDP has developed a GoogleSheet for all 14 MIMARKS environmental packages to help you manage the metadata for your samples. These sheets are ideal for use in a remote collaborative environment; and since they are stored by Google, they don’t require any special user IT infrastructure. RDP also aids myRDP users by helping with SRA submissions to the European Nucleotide Archive (ENA) and NCBI’s GenBank. See our GoogleSheets Help to begin – http://rdp.cme.msu.edu/misc/googleSheetsHelp.jsp DESIGN: RDP initiated two surveys (RDP Habitat and Terragenome) to assess metadata requirements for GSC PROMOTE: RDP and Terragenome have dedicated pages to make our users aware of the GSC standards MANAGE: RDP develops the MIMARKS GoogleSheet to help our users to collect and manage MIMARKS metadata for their samples SUBMIT: The myRDP SRA PrepKit helps our users prepare the complicated XML documents required for submitting SRA with metadata conforming to the MIMARKS specification PUBLIC USE: Hierarchy Browser search function works on MIMARKS attributes Genomic Standards Consortium (GSC) http://gensc.org/gsc/ International Soil Metagenome Sequencing Consortium (Terragenome) http://www.terragenome.org PS20 Bioinformatics in Microbial Ecology Poster 006A GFClassify takes nucleotide sequences as input so it can be used as a fast prefilter on raw reads. Utilizes interpolated context models (ICMs) to classify query sequences into one of a set of predefined classes. ICMs are used to predict the probability of a symbol (in this case a nucleotide) given the preceding k symbols, and works well in separating differentially conserved sequences. Tested on 454 amplicon libraries constructed using primers that co-amplify ammonium monooxygenase (amoA) and particulate methane monooxygenase (pmoA) to separate reads of the two paralogous gene families, and on 454 libraries constructed using primers amplifying multiple well-defined groups in the nifH and nifH-like protein families to separate reads into the individual nifH groups. Method amoA pmoA bg Error Rate Time(s) GF Classify 984 982 10,673 6.0E -4 19.00 HMMER3 984 982 10,673 6.0E -4 423.63 Performance on Database Sequences GFClassify nifH Group Nearest Neighbor Class 1 2 3 4-5 1 249,829 432 48 0 2 20 27,158 12 0 3 0 0 785 1 4-5 16 2 2 25 Amplicon Classification Errors Group AK FL HI UT Overall I 91.44 97.94 92.52 96.57 92.84 II 8.52 2.01 7.29 3.40 7.00 III 0.04 0.03 0.19 0.03 0.16 IV-V 0.00 0.01 0.00 0.00 0.00 Groups: I typical Mo–Fe II anaerobes & Archaea III alternate metal IV-V nifH-like 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 40 50 60 70 80 90 100 HMMFrame FragGeneScan Alignment errors per Amino Acid (%) Percent identity to subject sequence Accuracy rate of FrameBot compared to FragGeneScan and HMMFrame using nifH defined community sequences NMDS using Chao-corrected Jaccard index distance for Neon samples based on FrameBot nearest match Amplicon Data Analysis: We applied this FrameBot to a large set of nifH pyrosequencing libraries from 222 soil samples collected from four NEON ecological observatories in four different ecoclimatic zones and totaling over 1.1 million reads (278,561 unique sequences). These reads were compared to a reference library of 675 protein sequences based on the Zehr collection (http://www.es.ucsc.edu/~wwwzehr/research/database/) It required 11.5 hours to frameshift-correct and calculate nearest neighbors for these unique reads using a single CPU on a Mac Pro with 2.5 GHz Intel Core i5 Processor, a 15-fold speed-up compared to a non-indexed approach. Xander: Gene-Targeted Shotgun Assembly RDP Classifier and New Fungal Classifier RDP Taxonomy Major Updates RDP’s Popular Online Tools FrameBot: A Tool for Frameshift Correction and Nearest-Neighbor Classification GFClassify: Classify Query Sequences to Exactly One of a Set of Related Gene Families New Interactive Workflow Tutorials RDP Actively Supports Community-Driven Annotation Standards

Transcript of RDP Tools Gene-Targeted Analysisrdp.cme.msu.edu/download/posters/ISME2012.pdf · GF Classify 984...

Page 1: RDP Tools Gene-Targeted Analysisrdp.cme.msu.edu/download/posters/ISME2012.pdf · GF Classify 984 982 10,673 6.0E-4 19.00 HMMER3 984 982 10,673 6.0E-4 423.63 Performance on Database

Xander is a De Bruijn Graph assembler designed for gene-targeted meta-genomic assembly. We use a space efficient graph representation to enable scaling to large datasets. Xander is a local assembly tool; starting from a node in the graph, we walk in each direction using a Hidden Markov Model as a guide to assemble genes of interest. In order to explore population level diversity we have developed methods to find additional, sub-optimal, paths.

HMP Dataset:

100M 101-bp reads, 15G metagenomic shotgun Human Gut data from an ulcerative colitis (UC) patient who underwent a colectomy followed by ileal pouch anal anastomosis. In this procedure, the entire colon is resected, the terminal ileum is fashioned into a pouch, connected to the anal canal and the intestinal flow is re-established.

Gene: but ���(butyryl-CoA transferase)

Butyrate serves as the major energy source of colonocytes, has anti-inflammatory properties, and regulates gene expression, differentiation and apoptosis in host cells. In healthy individuals the but pathway is the major pathway for butyrate production in human gut. Results:

Xander searched and assembled 56 unique protein sequences with length >100. Only two nearly identical sequences were full length. These were very similar (2 and 4 AA substitutions) to a but gene from the HMP reference genome sequence of Acidaminococcus sp. D21, isolated from a healthy human gut.

Sequence Nucleotide

Substitutions AA

Substitutions

1 4 20 V I

194 Q P

2 8

20 V I

139 V G

141 A S

194 Q P

Substitutions found in the ���two full-length but sequences

assembled by Xander

§  FrameBot is a tool for frameshift correction and nearest-neighbor classification that uses a dynamic programming algorithm to align a query DNA sequence against a set of protein sequences.

§  Given a DNA read and a set of known target sequences, it produces corrected protein and DNA sequences and an optimal global or local protein alignment. It also helps filter out non-target reads.

§  The algorithm allows amino acid changes, insertions and deletions, and nucleotide insertions and deletions of one or two bases, thereby repairing frameshifts.

§  To enhance the speed of FrameBot as a nearest neighbor classifier we employed a metric indexing strategy. By pre-computing the distance between the target sequences and applying the metric index, the number of comparisons required to find the closest reference was largely reduced.

§  The tool was tested on 454 titanium amplicon libraries from defined communities covering portions of three genes: nitrogenase reductase (nifH), biphenyl dioxygenase alpha subunit (bphA) and butyryl-CoA:acetate CoA-transferase (but).

Acknowledgements: RDP is supported by the Office of Science (BER), U.S. Dept. of Energy, Grant No. DE-FG02-99ER62848, and NIEHS Grant No. 5P42 ES004911-18/19

RDP Tools for Gene-Targeted Analysis

James Cole1,4*, Benli Chai1,4, Jordan Fish1,4, Qiong Wang1,4, Donna McGarrell1,4, C. Titus Brown2,3, Yanni Sun2 and James Tiedje1,3,4

1Center for Microbial Ecology; 2Computer Science & Engineering; 3Microbiology & Molecular Genetics; 4Plant, Soil and Microbial Sciences; Michigan State University, East Lansing, Michigan 48824

RDP Classifier Feature Highlights: §  Newly updated Bacterial and Archaeal training set

version 9 now contains 10,049 training sequences.

§  New output format offering a choice of a fixed number of ranks, removing uncommon ranks (suborder, subgenera).

§  Command line version now has options for minimum word count, number of bootstrap tests, and ability to train on any taxonomic rank.

§  Provide convenient leave-one-out testing package for users to check classification accuracy rate and identify errors in their own training taxonomy and sequences.

§  Source code, compiled programs and���training data are freely available at ���http://sourceforge.net/projects/rdp-classifier/

§  Available online and as a SOAP service ���through RDP’s website.

RDP MultiClassifier:

§  Processes multiple samples, produces taxon vs. sample spreadsheet.

§  Format for taxonomy-supervised diversity analysis as described in Sul et al. Proc Natl Acad Sci USA. 2011 Aug 30;108(35):14637-42.

§  Results from multiple runs can be combined into a single spreadsheet.

RDP Fungal Classifier:

We are excited to be able to ���offer the RDP Classifier trained on the fungal LSU taxonomic data published by Andrea Porras-Alfaro, Gary Xie, Cheryl Kuske and their co-workers (Liu et al., Appl Environ Microbiol. 2012 Mar;78(5):1523-33). We expect this collaboration to lead to development of other analysis tools for the fungal community.

§  Recently named Phylum "Armatimonadetes" replaces candidate division OP 10.

§  To avoid confusion for those aggregating high-throughput reads from environments containing chloroplast DNA that may be amplified with universal primers, the phylum Cyanobacteria has been labeled Cyanobacteria/Chloroplast. Under this phylum the Chloroplasts have been added as a new artificial taxon class "Chloroplast”.

§  The Acidobacteria and Verrucomicrobia have been modified to include additional new formal taxa. Published informal taxonomies have been maintained for these groups as isolates do not yet cover much of the known diversity.

§  The genus Clostridium has been rearranged based on the Revised Road Map to the Phylum Firmicutes from Bergey's Manual of Systematic Bacteriology to better represent the polyphyletic nature of the genus.

§  The recently named Negativicutes class in the phylum Firmicutes has been added along with associated rearrangements of the other Firmicutes classes.

§  An additional 39 recently named genera have been added to the taxonomy.

myRDP space upload and analyze your own

16S sequences in your private space

Soap Services integrate RDP tools into your pipeline

Browsers browse and select from taxonomic

hierarchy / powerful search and selection features

RDP Classifier places sequences into bacterial and fungal

taxonomies

ProbeMatch fast search algorithm,

limit searches to specific regions

Pyro and FunGene Pipelines tools for gene-targeted

metagenomics

MIMARKS GoogleSheet helps organize standards-compliant metadata

§  Each tutorial comes with sample input and output data with step-by-step instructions.

§  Tutorials cover individual tools and complex workflow routines such as processing NGS data.

§  Features both web and command-line tools (e.g., RDP Classifier, MultiClassifier and mcClust).

Helping researchers understand best practices for using RDP tools: http://rdp.cme.msu.edu/tutorials/  

SeqMatch finds nearest neighbor,

more accurate than BLAST

Trained on 8506 curated fungal 28S rRNA gene reference sequences, along with a hand-vetted fungal taxonomy including 1702 genera plus higher-level taxa

Contact: [email protected]

http://rdp.cme.msu.edu

RDP has developed a GoogleSheet for all 14 MIMARKS environmental packages to help you manage the metadata for your samples. These sheets are ideal for use in a remote collaborative environment; and since they are stored by Google, they don’t require any special user IT infrastructure.

RDP also aids myRDP users by helping with SRA submissions to the European Nucleotide Archive (ENA) and NCBI’s GenBank.

See our GoogleSheets Help to begin – http://rdp.cme.msu.edu/misc/googleSheetsHelp.jsp

DESIGN: RDP initiated two surveys (RDP Habitat and Terragenome) to assess metadata requirements for GSC

PROMOTE: RDP and Terragenome have dedicated pages to make our users aware of the GSC standards

MANAGE: RDP develops the MIMARKS GoogleSheet to help our users to collect and manage MIMARKS metadata for their samples

SUBMIT: The myRDP SRA PrepKit helps our users prepare the complicated XML documents required for submitting SRA with metadata conforming to the MIMARKS specification

PUBLIC USE: Hierarchy Browser search function works on MIMARKS attributes

Genomic Standards Consortium (GSC) http://gensc.org/gsc/

International Soil Metagenome Sequencing Consortium (Terragenome)

http://www.terragenome.org

PS20 Bioinformatics in Microbial Ecology Poster 006A

§  GFClassify takes nucleotide sequences as input so it can be used as a fast prefilter on raw reads.

§  Utilizes interpolated context models (ICMs) to classify query sequences into one of a set of predefined classes.

§  ICMs are used to predict the probability of a symbol (in this case a nucleotide) given the preceding k symbols, and works well in separating differentially conserved sequences.

§  Tested on 454 amplicon libraries constructed using p r i m e r s t h a t c o - a m p l i f y a m m o n i u m monooxygenase (amoA) and particulate methane monooxygenase (pmoA) to separate reads of the two paralogous gene families, and on 454 libraries constructed using primers amplifying multiple well-defined groups in the nifH and nifH-like protein families to separate reads into the individual nifH groups.

Method amoA pmoA bg Error Rate

Time(s)

GF Classify 984 982 10,673 6.0E-4 19.00

HMMER3 984 982 10,673 6.0E-4 423.63

Performance on Database Sequences

GFClassify nifH Group

Nearest Neighbor Class

1 2 3 4-5

1 249,829 432 48 0

2 20 27,158 12 0

3 0 0 785 1

4-5 16 2 2 25

Amplicon Classification Errors

Group AK FL HI UT Overall

I 91.44 97.94 92.52 96.57 92.84

II 8.52 2.01 7.29 3.40 7.00

III 0.04 0.03 0.19 0.03 0.16

IV-V 0.00 0.01 0.00 0.00 0.00

Groups: I typical Mo–Fe II anaerobes & Archaea III alternate metal IV-V nifH-like

!

00.1

0.20.30.4

0.50.6

0.70.8

405060708090100

HMMF

rame Fra

gGene

Scan

Alig

nmen

t er

rors

pe

r A

min

o A

cid

(%)

Percent identity to subject sequence

Accuracy rate of FrameBot compared to ���FragGeneScan and HMMFrame using ���nifH defined community sequences  

NMDS using Chao-corrected Jaccard index distance for Neon samples based on FrameBot nearest match

Amplicon Data Analysis:

We applied this FrameBot to a large set of nifH pyrosequencing libraries from 222 soil samples collected from four NEON ecological observatories in four different ecoclimatic zones and totaling over 1.1 million reads (278,561 unique sequences). These reads were compared to a reference library of 675 protein sequences based on the Zehr collection (http://www.es.ucsc.edu/~wwwzehr/research/database/) It required 11.5 hours to frameshift-correct and calculate nearest neighbors for these unique reads using a single CPU on a Mac Pro with 2.5 GHz Intel Core i5 Processor, a 15-fold speed-up compared to a non-indexed approach.

Xander: Gene-Targeted Shotgun Assembly

RDP Classifier and New Fungal Classifier

RDP Taxonomy Major Updates

RDP’s Popular Online Tools

FrameBot: A Tool for Frameshift Correction and Nearest-Neighbor Classification

GFClassify: Classify Query Sequences to Exactly One of a Set of Related Gene Families

New Interactive Workflow Tutorials

RDP Actively Supports Community-Driven Annotation Standards