Post on 23-Dec-2015
Center for Genomic Epidemiology
Aim: • To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis.
• To develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community.
Tools for species identification
Name of Service Description
URL (cge.cbs.dtu.dk/services/) Status Publication
SpeciesFinder Species identification using 16S rRNA
SpeciesFinder Online Published Feb 2014 PMID: 24574292
KmerFinder Species identification using overlapping 16mers
KmerFinder Online Published Jan 2014 PMID: 24172157
TaxonomyFinder Taxonomy identification using functional protein domains
TaxonomyFinder Published in PMID: 24574292 + Oksana's PhD thesis
Reads2Type Species identification on client computer
Reads2Type Online Published Feb 2014 PMID: 24574292
Training data 1,647 completed / almost completed genomes downloaded
from NCBI in 2011 (1,009 different species)
Evaluation data NCBI draft genomes
• 695 isolates from species that overlap with training set (151 species)
SRA draft genomes• 10,407 sets of short reads from Illumina (168 species)
• 10,407 draft genomes from Illumina data (168 species)
16S rRNA
• 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977)
• Tremendous amounts of 16S rRNA sequence data are available in databases
Concerns: • Low resolution • Some genomes contain several copies of the 16S rRNA gene with inter-gene variation• The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome
Reference database • 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007).
Method•Input genomes are BLASTed against 16S rRNA genes in reference database.
•Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments.
CGE implementation of 16S species identification
SpeciesFinder
KmerFinder• Genomes in training data is chopped into 16mers:
A T G A C G T A T G A T T G A T G A C G T A G T A G T C C
• Immune system inspired downsampling• Only 16mers with specific prefix are kept
MHC-I
9mer
ATGAATGTGTGAGTGA
ATGACTGTGCCCCTGA
ATGAAAAAAAAAAAA
Unique 16 mers:
Species Match No. of Kmer hits
Acinetobacter baumannii CP001921 2
Acinetobacter baumannii CP000521 1
Acinetobacter baumannii CP002521 1
Buchnera aphidicola CP002301 1
ATGAATGTGTGAGTGACP001921 (Acinetobacter baumanii)CP000521 (Acinetobacter baumanii)CP002522 (Acinetobacter baumanii)
ATGACTGTGCCCCTGA CP001921 (Acinetobacter baumanii)CP002301 (Buchnera aphidicola)
16mer database
Unknown isolate
KmerFinder is very robust – it only needs one 16mer! Desulfovibrio piger GOR1 SRR097356
>NODE 4 length 92 cov 23.119566TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC>NODE 15 length 82 cov 2.792683AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCACGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT
N50 = 110Total no. of bp: 210
Species Match No. of Kmer hits
Flavobacterium psycrophilum
AM398681 1
Prediction
Reads2Type
• Read2Type pushes analysis to user, server provides 50-mers database
• SuffixTree: efficient data structure for string matching
• Narrow Down Approach: – Reads2Type compares 50-mers
of combined marker genes against raw reads
– Shared Probes vs Unique Probe
• Definition: Quick & dirty taxonomy identification of single isolates
• 50-mer of marker gene DB–16S rRNA: Training data
genomes RNAmmer (other)
– ITS: Training data (Mycobacterium)
–GyrB: Training data (Enterobacteriaceae)
–Resulting database ~5 MB
rMLST
CGE implementation
•For each genome in the training data the 53 ribosomal genes were extracted.
•Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match).
•The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes.
Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4):1005-15.
Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data.
Speed
Method Estimated speed (mm:ss)
16S 00:13*
KmerFinder 00:09*
TaxonomyFinder 11:33*
rMLST 00:45*
Reads2Type 00:55**
*Estimation based on draft genomes**Estimation based on short reads
Summary of taxonomy benchmark study
• KmerFinder had the highest accuracy and was the fastest method.
• SpeciesFinder (16S rRNA-based) had the lowest accuracy.
• Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distin-guishing species that only recently diverged, especially when main difference is a plasmid.
Tools for further typing
Name of Service Description
URL (https://cge.cbs.dtu.dk/services/ ) Publication
MLSTMultilocus sequence typing MLST
Published Apr 2012, PMID: 22238442
Plasmid-Finder
Identification of plasmids in Enterobacteriaceae
PlasmidFinder Published Apr 2014, PMID 24777092
pMLST pMLST of plasmids in Enterobacteriaceae
pMLST Published Apr 2014, PMID 24777092
Multilocus Sequence Typing (MLST)
First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145)
The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing
Different alleles are each assigned a random number
The unique combination of alleles is the sequence type (ST)
www.cbs.dtu.dk/services/MLST
Assembled genome454 – single end reads454 – paired end readsIllumina – single end readsIllumina – paired end readsIon TorrentSOLiD – single end readsSOLiD – mate pair reads
Acinetobacter baumannii #1Acinetobacter baumannii #2 Arcobacter Borrelia burgdorferi Bacillus cereus Brachyspira hyodysenteriae Bifidobacterium Brachyspiria intermedia Bordetella Burkholderia pseudomallei Brachyspira Burkholeria cepacia complex Campylobacter jejuni Clostridium botulinum Clostridium difficile #1 Clostridium difficile #2 Campylobacter helveticus Campylobacter insulaenigrae Clostridium septicum C. diphtheriae Campylobacter fetus Chlamydiales
Campylobacter lari Cronobacter C. upsaliensis Escherichia coli #1 Escherichia coli #2 Enterococcus faecalis Enterococcus faecium F. psychrophilum Haemophilus influenzae Haemophilus parasuis Helicobacter pylori Klebsiella pneumoniae Lactobacillus casei Lactococcus lactis Leptospira Listeria Listeria monocytogenes Moraxella catarrhalis Mannheimia haemolytica Neisseria P. gingivalis P. acne
Pseudomonas aeruginosa Pasteurella multocida Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis
Extended Output
aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro
Tools for phenotyping
Name of Service Description
URL (https://cge.cbs.dtu.dk/services/ ) Publication
ResFinder
Identification of acquired antibiotic resistance genes ResFinder
Published Nov 2012, PMID: 22782487
Virulence-Finder
Identification of virulence genes in E. coli (and S. aureus and Enterococcus)
VirulenceFinder E. coli published Feb 2014, PMID: 24574290.
MyDbFinder Identification of genes from the users own database
MyDbFinder Will be published in book chapter
Pathogen-Finder
Prediction of pathogenic potential
PathogenFinder Published Oct 2013, PMID: 24204795
ResFinder
ResFinder(BLAST)
NGSIllumina
Ion torrent454..
Sanger
Fasta
Resistance gene profile
Assembly pipeline
List of genesAccession numbers
Theoretical resistance phenotype
Sanger
Fasta
200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium)
ResFinder, 98 %ID, 60% length coverage
Phenotypic tests, 3,051 in total• 482 Resistant• 2569 Susceptible
=> 99,74% of the results were in agreement between ResFinder and the phenotypic tests
23 discrepancies -> 16, typically in relation to spectinomycin in E. coli
Unpublished or uncategorizedName of Service Description
URL (https://cge.cbs.dtu.dk/
services/ ) Status Publication
PanFunPro Groups homologous proteins based on functional domain content
PanFunProOnline
Published in F1000Research 2013, 2:265
Serotype-Finder
Identification of serotypes SerotypeFinder-1.0
Online
Not yet published
Restriction-ModificationFinder
Identification of RM system genes
Restriction-ModificationFinder
Online
Will only be published in book chapter
HostPhinder Prediction of the host of a bacteriophage
HostPhinderOnline, but under development
Not yet published
MetaVir-Finder
Identification of virus in metegenomic data
MetaVirFinderOnline, but under development
Not yet published
MGmapper
Identifies the content of metagenomic samples MGmapper
Online, but under development
Not yet published
Tools for phylogeny
Name of Service Description URL (cge.cbs.dtu.dk/services) Status Publication
SnpTree
Creation of phylogenetic trees based on SNPs snpTree Online
Published Dec 2012, PMID: 23281601
CSIPhylo-geny
Creation of phylogenetic trees based on SNPs
CSIPhylogenyOnline
Planned
NDtree Creation of phylogenetic trees
NDtree Online Published in Feb 2014, PMID: 24505344
Type of data uploaded to MLST web-service
454, single reads454, paired-endIon torrentIllumina, single readsIllumina, paired-end readsAssembled draft genomes