Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences

Post on 10-May-2015

28.549 views 3 download

Tags:

description

Presented at Cornell Symbiosis symposium. Workflow for processing amplicon based 16S/ITS sequences as well as whole genome shotgun sequences are described. Slides include short description and links for each tool. DISCLAIMER: This is a small subset of tools out there. No disrespect to methods not mentioned.

Transcript of Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences

Computational Tools for Metagenomics

Surya Saha Twitter: @SahaSurya / LinkedIn: www.linkedin.com/in/suryasaha/

Magdalen Lindeberg Plant Pathology & Plant-Microbe Biology

Microbial Friends & Foes, Sep 25, 2012

Temperton, Current Opinion in Microbiology, 2012

Impact of Technology on Metagenomics

Types of “Meta” genomics

16S rRNA survey of bacterial microbiome

ITS survey of fungal microbiome

Bellemain, BMC Microbiology 2010 Slide: Julien Tremblay, JGI

Types of “Meta” genomics

Whole genome shotgun • Varying complexity of microbial communities • High coverage sequencing • Sophisticated informatics • Host associated metagenomes

– Deep sequencing of host meta-genome – Bioinformatic screening of host sequences

• Environmental metagenomes – Eg. Soil samples – Requires very high depth of coverage – Complicated to assemble

Big picture!!

Big picture!!

What users see

Big picture!!

What users see

What users want!!

16S/ITS community surveys

• Multiple target regions in 16S gene and ITS region • Comparison of results requires amplification of same region • Advantages

– Fast survey of large communities – Mature set of tools and statistics for analysis – Good for first round survey

• 454 16S tags or pyrotags (~ 700 bp) have been the preferred method

• Illumina Miseq (2x150bp, 2x250 bp) are the next workhorses

• Depth of sampling – 2-6000 reads/sample for simple communities – 20000 reads /sample for complex soil metagenomes

16S/ITS issues

• Lack of tools for processing ITS/Fungal microbiome data sets – RDP classifier targets only ITS – No ITS reconstruction tools

• Amplification bias effects accuracy and replication • Use of short reads prevents disambiguation of similar

strains • 16S or ITS may not differentiate between similar strains

– Clustering is done at 97% – Regions may be >99% similar

• Sequencing error inflates number of OTUs • Chloroplast 16S sequences can get amplified in plant

metagenomes

16S/ITS sequence processing workflow Filter for contaminants and low quality reads

Assemble overlapping reads

Reduce datasets (clustering)

Perform taxonomic classification and compute diversity metrics

16S/ITS sequence processing workflow Filter for contaminants and low quality reads

Assemble overlapping reads

Reduce datasets (clustering)

Perform taxonomic classification and compute diversity metrics

• Quality plots and read trimming

– FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

– FASTX http://hannonlab.cshl.edu/fastx_toolkit/

• Chimera removal

– AmpliconNoise http://code.google.com/p/ampliconnoise/

– UCHIME http://www.drive5.com/uchime/

Impact of Sequence Length

Slide: Feng Chen, JGI

16S/ITS sequence processing workflow Filter for contaminants and low quality reads

Assemble overlapping reads

Reduce datasets (clustering)

Perform taxonomic classification and compute diversity metrics

• Merge overlapping paired end reads

– FLASH http://www.genomics.jhu.edu/software/FLASH/index.shtml

– FastqJoin http://code.google.com/p/ea-utils/wiki/FastqJoin

– CD-HIT read-linker http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit-auxtools-manual

16S/ITS sequence processing workflow Filter for contaminants and low quality reads

Assemble overlapping reads

Reduce datasets (clustering)

Perform taxonomic classification and compute diversity metrics

• Clustering with high stringency

– UCLUST/USEARCH (16S only) http://www.drive5.com/usearch/

– CD-HIT-OTU (16S only) http://weizhong-lab.ucsd.edu/cd-hit-otu/

– phylOTU (16S only) https://github.com/sharpton/PhylOTU

16S/ITS sequence processing workflow Filter for contaminants and low quality reads

Assemble overlapping reads

Reduce datasets (clustering)

Perform taxonomic classification and compute diversity metrics

• Composition based classifiers – RDP database + classifier http://rdp.cme.msu.edu/classifier/classifier.jsp

• Homology based classifiers – ARB + Silva database (16S only) http://www.arb-home.de/

– GreenGenes database (16S only) http://greengenes.lbl.gov/cgi-bin/nph-index.cgi

– UNITE database (ITS only) http://unite.ut.ee/

– FungalITSPipeline (ITS only) http://www.emerencia.org/fungalitspipeline.html

• http://www.qiime.org/

• Comprehensive suite of tools – OTU picking

– Taxonomic classification

– Construction of phylogenetic trees

– Visualization

– Compute diversity statistics

• Available as Amazon EC2 image

Whole Genome Shotgun (WGS) Metagenomics

• Better classification with Increasing number of complete genomes

• Focus on whole genome based phylogeny (whole genome phylotyping)

• Advantages – No amplification bias like in 16S/ITS

• Issues – Poor sampling of fungal diversity – Assembly of metagenomes is complicated due to

uneven coverage – Requires high depth of coverage

WGS sequence processing workflow

Filter for low quality reads

Assemble reads

Perform taxonomic classification and compute diversity metrics

WGS sequence processing workflow

Filter for low quality reads

Assemble reads

Perform taxonomic classification and compute diversity metrics

• Quality plots and read trimming

– FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

– FASTX http://hannonlab.cshl.edu/fastx_toolkit/

WGS sequence processing workflow

Filter for low quality reads

Assemble reads

Perform taxonomic classification and compute diversity metrics

• NGS assembly with uneven depth

– IDBA-UD http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

– MIRA http://www.chevreux.org/projects_mira.html

– Velvet / MetaVelvet http://www.ebi.ac.uk/~zerbino/velvet/

http://metavelvet.dna.bio.keio.ac.jp/

WGS sequence processing workflow

Filter for low quality reads

Assemble reads

Perform taxonomic classification and compute diversity metrics

• Hybrid composition/homology based classifiers – FCP http://kiwi.cs.dal.ca/Software/FCP

– Phymm/PhymmBL http://www.cbcb.umd.edu/software/phymm/

– AMPHORA2 http://wolbachia.biology.virginia.edu/WuLab/Software.html

– NBC http://nbc.ece.drexel.edu/

– MEGAN http://ab.inf.uni-tuebingen.de/software/megan/

WGS sequence processing workflow

Filter for low quality reads

Assemble reads

Perform taxonomic classification and compute diversity metrics

• Web based classifiers

– MG-RAST http://metagenomics.anl.gov/

– CAMERA http://camera.calit2.net/

– IMG/M http://img.jgi.doe.gov/cgi-bin/m/main.cgi

MetaPhAln

• Unique clade-specific markers for sequenced bacteria and archaea • 400 genuses/4000 genomes including HMP genomes • Species level resolution • MetaPhAln 2 in the works

– Eukaryotes including Fungi – Viruses – Higher coverage of archaea

• Krona and GraphAln for visualization of output • Websites

– https://bitbucket.org/nsegata/metaphlan – http://huttenhower.sph.harvard.edu/metaphlan

PhyloSift/pplacer

• Reference database of marker genes • Places reads on tree of life based on homology to

reference protein • Integration with metAMOS for pre-assembling next-

generation datasets • Bacterial and Archaeal classification only • Plant and Fungi marker genes are being added • Websites

– http://phylosift.wordpress.com/ – https://github.com/gjospin/PhyloSift

Real cost of Sequencing!!

Sboner, Genome Biology, 2011

Acknowledgements

Funding

Magdalen Lindeberg Cornell University

Dave Schneider USDA-ARS, Ithaca

Citrus greening / Wolbachia (wACP)

Thank you!

Surya Saha ss2489@cornell.edu

Suggestions

• Plan informatics workflow as early as possible

• Incorporate statistics at different stages in the workflow