Post on 21-Jan-2018
João André CarriçoMicrobiology Institute and Mramirez Lab, Instituto de Medicina Molecular, Faculty of Medicine, University of Lisbonjcarrico@fm.ul.pt twitter: @jacarrico
Bioinformatic Open Days 2017Braga 22 February 2017
Jon Snow
English physician 1813-1858
Total : 616 dead31 Aug -10 Sep: 500 dead
"the study of what is upon the people”
the branch of medicine which deals with the incidence, distribution, and possible control of diseases and other factors relating to health.
It is the cornerstone of public health, and shapes policy decisions and evidence-based practice by identifying risk factors for disease and targets for preventive healthcare
“shoe-leather epidemiology” and lots of statistics
EBOLA West African Ebola virus epidemic
2011 Germany E. coli O104:H4 outbreak :
bloody diarrhea accompanied by
hemolytic-uremic syndrome (HUS)
Genomic sequencing by BGI Shenzhen confirm a 2001 finding that the O104:H4 serotype has some enteroaggregative E. coli (EAEC or EAggEC) properties, presumably acquired by horizontal gene transfer
On 8 June, the EU's E. coli O104:H4 outbreak was
estimated to have cost ~2,690,000,000 EUR in human
losses (such as sick leave), regardless of material losses
(such as dumped cucumbers - ~240M Euro in Spain only)
Crowdsourcing event started at ABPHM2011
Smith, K. F. et al. Global rise in human infectious disease
outbreaks. Journal of The Royal Society Interface 11,
20140950–20140950 (2014).
Flight paths across North America.
Outbreaks follow flight paths more closely than
simple geographic distance.
Outbreaks have costs in mortality, mobility and
other direct economic impacts (product recall)
Fast intervention can save lives and money
Slide credit: Fiona Brinkman
1. Didelot, X., Bowden, R., Wilson, D. J., Peto, T. E. A. & Crook, D. W. Transforming
clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13, 601–612
(2012).
“Not all bacteria are born equal”
discriminating strains within a species/subspecies
Gel based:Pulsed Gel ElectrophoresisRAPDAFLP
Phenotypic based:Colony morphology/colorAntibiogramSerotype
Sequence Based:MultLocus Sequence Typing (MLST)emm typingspa typing
ST aroE gdh gki recP spi xpt ddl
156 7 11 10 1 6 8 1
Bacterial Population
Genetics
Pathogenesis and
NaturalHistory ofInfection
Surveillance ofInfectiousDiseases
Outbreak Investigation
and Control
S pneumoniae housekeeping genes
?
?
?
??
?
?
PCR
aroE
gdh
gki
recP
spi
xpt
ddl
7 Sequences
wwwhttp://pubmlst.org/spneumoniae/
Retrieve alleles and ST
ST aroE gdh gki recP spi xpt ddl
156 7 11 10 1 6 8 1
1
?aroE
gdh
gki
recP
spi
xpt
ddl
8
1
6
7
11
10
ST 156
SangerSequencing
Greatest advantages:Sequence reduced to allele IDPortableEasy to infer relationship
Nomenclature
Clinical
animalNA
community
HospitalSurv/Outb
Enterococcus faecium
The Internet
Sequence-basedInformation
But only 7 target loci ….
HiSeq 2000
MiSeq
PacBIO
OXFORD NANOPOREMinION
https://nanoporetech.com/products/minion
https://nanoporetech.com/products/smidgion
OXFORD NANOPORESmigION
Alikhan, N.-F. et al., 2011. BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC genomics, 12, p.402.
Bacterial Draft Genomes:- 1 circular chromossome- 1.5 -4 MB(most of them)- From a few to hundreds of contigs- may contain Plasmids- Hundreds of thousands bacterial
read sets are already available on SRA /ENA
- Usually sequenced at 30-100x depth of coverage
- Cost : 70-150 EUR (Illumina) / 500-2000 EUR (PacBIo/Nanopore)
Sequencing & Bioinformatics
• Sequencing, Assembly Pipeline Parameters
• QA/QC Metrics• Tree Construction Details
Sample Information
• Isolation source• Food, Clinical, Environment• Food category, Body Product• Dates, Location
Clinical and Epi Details
• Demographics• Host disease, Symptoms • Lab Test Results• Exposures
Slide credit: Will Hsiao
investigations using integrated microbial genomic data & “metadata”(lab, epidemiological data)
… aiming to save lives, economies
Slide credit: Fiona Brinkman
Chronicle of a Death Foretold
http://en.wikipedia.org/wiki/File:ChronicleOfADeathForetold.JPG
Game Changer for Microbial Typing
From the reads much more information can be extracted :
- gene-by-gene approaches: wgMLST, cgMLST
- SNP comparison approaches: comparison with reference
strains
- k-mer based distances
- Ability to recover most of the present sequence based
typing information in a single experimental procedure
- Greatly Increased discriminatory power
- Unifies genomics and typing
Microbiological
Sample
The Ideal Scenario
Magic Box of
NGS Wonders
for
Microbiology
Completely characterized strain:
• Antibiotic resistance profile• Multilocus Sequence Typing (MLST)• Virulence factors present• Other SBTM information .Ex:
• spa (S. aureus)• emm (Group A Streptococcus)
Desired End result:
Risk Assessment of the strain and
Useful application of the data to clinical practice
Comparison between groups of strains
Didelot, X., Bowden, R., Wilson, D. J., Peto, T. E. A. & Crook, D. W. Transforming
clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13, 601–612
(2012).
sample
HTS
reads
Reference
genome
VCF/Fasta File
with SNPs
• Uses a reference strain:• Outbreak determination• Comparative studies• Monomorphic (Clonal) species
• Recombination/Horizontal gene transfer must be detected and removed from phylogenetic analysis
• Difficult to create a nomenclature (due to different references)
Read mapping software
Phylogenetic/Minimum spanning Tree
61 Streptococcus dysgalactiae subspecies equisimilis isolates
Roary presence and absence matrix (10661 gene clusters)
Core (n or n-1 strains)
Soft-Core (n-2 or n-3 strains)
Shell( 8(?) to n-3 strains)
Cloud( <8 (?) strains)
Core genome:Core + Soft-Core
Accessory genome:Shell + Cloud
Catarina Inês Mendes(as you already noticed decorations were only for Xmas)
Virulome
Core genomeAccessory genome
Mobilome
Central nomenclature server:
Schemas,
Allele /Profile IDs
contigs
Output :Allelic Profile
• Expansion of the MLST concepts to core/pan genome
• Buffers recombination effect• Simpler to create a nomenclature• Population structure of non-
monomorphic species• Easy to compare thousands of
samples using thousands of loci• Handling Missing data is still an open
problem
sample
HTS
reads De novo assembly software
Phylogenetic/Minimum spanning Tree
This is Chewbacca ... He is chewBBACA’s cousin
Our approach to the problem:
Mickael Silva(He didn’t bring the glasses…)
https://pmcvariety.files.wordpress.com/2014/06/eli-wallach-dead-good-bad-ugly.jpg?w=670&h=377&crop=1
My Goals/ Areas that I want to apply WGS to: • Microbial population structure• Microbial Evolution• Microbial Genomics : gene structure, genome synteny,
Mobile Genetic Elements detection
My toolbox is chosen based on my questions and what I want to do !
Trying to avoid:“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” - Abraham H. Maslow (1962), Toward a Psychology of Being
Sequence QA/QCFastQChttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Adaptor and Quality trimming:trimmomatichttp://www.usadellab.org/cms/?page=trimmomatic
AssemblySPAdeshttp://bioinf.spbau.ru/spades
Velvet http://www.ebi.ac.uk/~zerbino/velvet/
MappingBowtie2http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Annotation:Prokkahttp://www.vicbioinformatics.com/software.prokka.shtml
Whole genome comparisonBRIG (Blast Ring Generator)http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
MAUVEhttp://darlinglab.org/mauve/mauve.html
http://rugbyea.com/wp-content/uploads/2013/05/blast.jpghttp://www.ecohealthypets.com/writable/pet_report_photos/photo/480x/ball_python_2.jpg
- Perform the same analysis over tens, hundreds or thousands of strains : your own and publicly available
- Integrate multiple analysis in a single pipeline- Pipelines = reproducibility (if not something is very wrong)
http://www.ebi.ac.uk/ena
http://www.ncbi.nlm.nih.gov/sra
A Standardized Pipeline for Bacterial Genome Assembly and Quality Control
Miguel Machado
(not wearing his wolf suit today…I think...)
Virulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm)
Pathosystems Resource Integration Center (PATRIC) VF (https)://www.patricbrc.org/)
Victors (http://www.phidias.us/victors/)
PHI-Base (http://www.phi-base.org/)
MvirDB (http://mvirdb.llnl.gov/ )
To know more: - Presentation on the Controversies in interpreting whole genome sequence data session :
http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases
Comprehensive Antibiotic Resistance Database (CARD) (https://card.mcmaster.ca/)
Repository of Antibiotic resistance Cassetes (RAC)(http://rac.aihi.mq.edu.au/rac/)
Integrall :The integron database (http://integrall.bio.ua.pt/)
(…)
“Formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts” – Wikipedia
Domain modeling: represents all the concepts involved in in microbial typing by sequence-based methods
Provides a shared vocabulary, where the concepts should be unambiguous
Enables a machine-readable format that can be used for software and algorithms automatically interact with multiple databases
Existing DBs reuse each others datasets without truedatabase interoperability: need for common ontologies(controlled vocabularies already exist but are not used byall)
Ontologies and computer readable data formats (json-ld or RDF) can allow for true database interoperabilityallowing bioinformaticians to extract the targetedinformation from a single query reaching multipledatabases
Trends Microbiol 17, 279–285 (2009).
GenEpiO: Combining Different Epi, Lab,
Genomics and Clinical Data Fields.
Lab AnalyticsGenomics, PFGE
Serotyping, Phage typingMLST, AMR
Clinical DataPatient demographics,
Medical History, Comorbidities, Symptoms,
Health Status
ReportingCase/Investigation Status
GenEpiO(Genomic Epidemiology Application Ontology)
See draft version at https://github.com/Public-Health-Bioinformatics/IRIDA_ontology
Original slide fromEmma Griffiths
Public Health Surveillance
Case Cluster Analysis
Result Reporting
Infectious Disease Epidemiology (from case to Intervention)
Lab Surveillance (from sample to strain typing results)
Evidence Collection
& Outbreak Investigation
Sample Collection& Processing
Sequence Data Generation &
Processing
Bioinformatics Analysis
Result Reporting
Whole Genome Sequencing (SO, ERO, OBI etc)
Quality Control (OBI, ERO)
Anatomy (FMA)
Environment (Envo)
Food (FoodOn)
Clinical Sampling (OBI)
Custom LIMS
Quality Control (OBI, ERO)
AMR (ARO)
Virulence (PATO)
Phylogenetic Clustering (EDAM)
Mobile Elements (MobiO)
Quality Control (OBI, ERO)
AMR (ARO) LOINC
Surveillance (SurvO)
Demographics (SIO)
Patient History (SIO)
Symptoms (SYMP)
Exposures (ExO)
Source Attribution (IDO)
Travel (IDO)
Transmission (TRANS)
Food (FoodOn)
Geography (OMRSE)
Outbreak Protocols
Surveillance (SurvO)
Food (FoodOn)
Surveillance (SurvO)
Mobile Elements (MobiO)
Infectious Disease (IDO)
Typing (TypON)
Nomenclature & Taxonomy (NCBItaxon)
Original slide from Emma Griffiths /IRIDA
htt
p:/
/fo
od
on
tolo
gy.
git
hu
b.io
/fo
od
on
/
(pipeline) NGSOnto
Available databases still lack interfaces forprogrammatic access : RESTful APIs would allow:▪ easy automatic querying from scripts without the need
of web interfaces or downloads
▪ Database updates by authorized groups (distributedcuration effort)
APIs : Application Programming Interfaces
Now we have thousands of targets for thousands of strains annotated with precious epidemiological data
Traditional phylogenetic analysis methods aren’t able to tackle the existing amount of information
Freely available /Open sourceJava software
Calculates:goeBURST MSTHierarchical clusteringNeighbour Joining
Can be easily applied to:- MLST/ cgMLST/wgMLST- MLVA- SNP data*- Gene Presence/absence
https://online.phyloviz.net/
API: *account creation*profile + metadata upload*running goeBURST*retrieving a link
Private or Public data sharing
Scalable to thousands of nodes
Tree Analysis tools:Interactive distance matrixNLV graph Node.js / VivaGraph.js (webGL)
Screenshot by @happy_khanWith Enterobase datahttps://enterobase.warwick.ac.uk/
Bruno Gonçalves(He also didn’t bring the glasses…)
• High Throughput Sequencing changed our views and ways to analyze bacterial populations and discriminate strains for outbreak investigation /surveillance purposes - > Genomic Epidemiology
• Bioinformatics is the key item for global genomic epidemiology. Open-source and freely-available tools provide the ability to build custom-made and verifiable pipelines.
• Real time global data sharing can speed up outbreak investigations and save lives…however some ethical /confidential issues need to be handled
• It is computationally challenging when we want to analyze and query all data produced. Most methods don’t scale well
• The future: Isolation free methods are needed: Speed up the analysis Metagenomics
Algorithms
Interfaces
Ontologies
UMMI Members Bruno Gonçalves Mickael Silva Catarina Inês Mendes Miguel MAchado Mário Ramirez José Melo-Cristino
INESC-ID Alexandre Francisco Cátia Vaz Marta Nascimento
EFSA INNUENDO Project (https://sites.google.com/site/innuendocon/) Mirko Rossi
BACGENTRACK project [FCT / Scientific and Technological Research Council of Turkey (Türkiye Bilimsel ve Teknolojik AraştırmaKurumu, TÜBİTAK), TUBITAK/0004/2014]
ONEIDA project FCT Joint Activities Programme (PAC) - http://www.itqb.unl.pt/oneida
Genome Canada IRIDA project (www.irida.ca) Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar (NLM , PHAC) Ed Taboada and Peter Kruczkiewicz (Lab Foodborne Zoonoses, PHAC) Fiona Brinkman (SFU) William Hsiao (BCCDC)
INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS