16S classifier

24
16S Classifier: a tool for fast and accurate classification of 16S rRNA sequences Ashok K. Sharma Research Scholar Metagenomics and Systems Biology Laboratory Indian Institute of Science Education and Research, Bhopal

Transcript of 16S classifier

16S Classifier: a tool for fast and accurate classification of 16S rRNA sequences

16S Classifier: a tool for fast and accurate classification of 16S rRNA sequences

Ashok K. SharmaResearch ScholarMetagenomics and Systems Biology LaboratoryIndian Institute of Science Education and Research, Bhopal

Species DiversityOverview

ArcobacterPaludibacterShewanellaPseudomonasUnknownSpecies Richness

MetagenomeMicrobial diversity of soil and other extreme environments are still limited

Only 1-3% of soil microbes are culturable

Estimated in 1g of soil = 4000- 5000 different bacterial genomic units

Bacteria and fungi plays an important role in biogeochemical cycles, and specially in human health

Species diversity consists of:1. Species richness, 2. Total number of species, and 3. Distribution of species2

Methods of studying microbial diversityBiochemical

Plate count

Community level physiological profiling

Fatty acid methyl ester analysis: as fatty acids make up constant proportion of cell biomassMolecular

G+C content

Nucleic acid re-association and hybridization

DNA microarray

DNA cloning and sequencing-based methods

Plate count is fast and cost effective but having disadvantage of not detection of unculturable microbes, bias towards fast growing, bias towards fungal speciesCLPP is fast, highly reproducible, inexpensive and generate large amount of data but having disadvantage of only represent culturable community, favour fast growingFAME: no culturing needed, directly extracted from soil, but having disadvantage of affecting by external factors. 3

Metagenomic reads vs 16S rRNA for microbial diversity identification

Metagenome

DNA IsolationFragmentation of DNAMetagenomic Reads

Amplification of 16S rRNA 16S rRNA from multiple speciesMicrobial diversityTools: Kraken, PhylopathiaS, Phymm, phymmBL, MetabinMicrobial diversity

16S rRNA a gold standard for microbial molecular identificationUniversal Highly conservedLong enough (~1500 bp) to provide significant discrimination between many speciesStructural information can guide alignment and phylogenetic reconstructionMany species now represented in the database16S rRNA gene sequencingEarlier By sequencing whole geneNow By sequencing short variable regionsLimitations:

Insufficient and underestimated diversity

16S rRNA gene

16S rRNA: to understand microbial diversity

Community composition shifts over time as revealed by 16S data

Software and tools available for the analysis of 16S rRNA dataCloVR-16S

QIIME a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metricUniFrac,UCLUST, PyNASTand theRDP Bayesian classifier;

Mothur a C++-based software package for 16S analysis;

Metastatsand custom R scripts used to generate additional statistical and graphical evaluations.

Most recent: 16S Classifier Random forest based standalone package specially for short hypervariable regions

Material and methodsGreen genes databaseRandom forestEmbossRDP ClassifierBLAST

Input Data for TrainingIn 16S Classifier, we made separate models for different Hypervariable regions of 16S rRNA gene

Took Greengenes 16S rRNA database

Extracted individual HVRs as well as combination of 2 or more commonly used HVRs using commonly used Universal primers with the help of in-house perl scripts and EMBOSS software suit

Discarded HVRs where primer coverage was lesser than 50% of all sequences

Clustered out highly similar sequences using CD-hit at threshold 1.

Table 1. Summary of the number of HVR sequences which were used for the training and testing of RF*.

Parameters optimizationsLabeled each sequence with its taxonomic information to the lowest known level except species

Used V3 region for optimization of parameters

Calculated 2-mer, 3-mer, 4-mer, 5-mer, 6-mer nucleotide frequencies and tried them as feature inputs

Tried various mtry values at each k-mer to get the least OOB error value

Got best results at k = 4. So utilized 4-mer nucleotide frequencies for building models at ntree = 1000.

Figure 1. Optimization of parameters using hypervariable region V3

Variables selection

ntree optimization

OOB Error for Different HVRs

Input data for testingFirst test dataset was obtained by randomly extracting ~10% of the sequences which we had clustered out using CD-hit earlier. 1% random mutations were inserted in these sequences to mimic real life sequencing errors

Second dataset was obtained from real metagenomics sequences available from SRA dataset of NCBI

Performance of 16S Classifier was compared with that of RDP Classifier in terms of accuracy as well as time taken for computation.

Performance Of Different RF Models On Different Hvrs And Complete 16S rRna Gene

Performance Of RF Models On First Test Dataset

Comparison Of 16S Classifier With RDP Classifier On Real Datasets

Advantages of 16S Classifier

Extremely fast

High sensitivity as well as specificity

Consistent across various HVRs

Easy availability

Easy to deploy and use

How to useUser can download zip file of a particular hypervariable region or complete 16S, which is freely available at http://metagenomics.iiserb.ac.in/16Sclassifier/download.html

Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and an exe file (16Sclassifier.exe).

Other dependencies:

User has to install R from the following linkhttp://cran.r-project.org/

intall Randomforest

## Command line usage ##

./16sclassifier.exe

The query file should be in Fasta format and the model name could be v2, v3, v4, v5, v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and complete.

Thank You