Making Use of NGS Data: From Reads to Trees and Annotations

Post on 14-Jan-2017

568 views 2 download

Transcript of Making Use of NGS Data: From Reads to Trees and Annotations

João André Carriço, PhDMicrobiology Institute/Institute for Molecular MedicineFaculty of Medicine, University of LisbonPortugal

Making Use of NGS Data: from Reads to Trees and

Annotations

http://im.fm.ul.pthttp://imm.fm.ul.pthttp://www.joaocarrico.info

WORKSHOP 24:NGS FOR MICROBIAL

GENOMIC SURVEILLANCE AND MORE - ONE

TECHNOLOGY FITS ALL

Conflicts of interest

Nothing to disclose

Disclaimer This presentation is not intended to cover all available

software or databases (we would need several weeks or months to do that)

I’ll present what I use or intend to use in a near future

I gladly accept any suggestions to included on similar presentations in the future.

It is supposed to be interactive so ask away during the presentation.

Summary What is in the reads FASTQ files

Available Databases Virulence Factors and AMR DBs Sequence-based typing databases: Pubmlst.org / Enterobase

High Throughput Sequencing data analysis (freeware) Prokka Roary Nullabor Microreact.org PHYLOViZ

Commercial Solutions Bionumerics 7.5 CLC Genomics Workbench (CLC Bio) Ridom Seqsphere+

What is in the reads FASTQ files?

Isolate Genome*

Sequenced Reads

Slide Source: Nick Loman

Other isolates in the sequencing run

Contamination

* Chromosome + Plasmids + Phages

Databases

VF DatabasesVirulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm) Pathosystems Resource Integration Center (PATRIC)

VF (https)://www.patricbrc.org/) Victors (http://www.phidias.us/victors/) PHI-Base (http://www.phi-base.org/) MvirDB (http://mvirdb.llnl.gov/ )

To know more: - Presentation on the Controversies in interpreting whole genome sequence data session : http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases

Antibiotic Resistance Databases Comprehensive Antibiotic Resistance Database

(CARD) (https://card.mcmaster.ca/)

Repository of Antibiotic resistance Cassetes (RAC) (http://rac.aihi.mq.edu.au/rac/)

Integrall :The integron database (http://integrall.bio.ua.pt/)

(…)

Sequence Based Typing :Pubmlst /BIGSdb

http://www.pubmlst.org

http://bigsdb.web.pasteur.fr/

Sequence Based Typing :Enterobase

slide by @happy_khan

Martin SergeantMark AchtmanNabil-Fareed AlikhanZhemin Zhou

Sequenced my strain…now what?

To know more : http://www.slideshare.net/nickloman/eccmid-2015-so-i-have-sequenced-my-genome-what-now

Reads(fastq files)

contigs(fasta files)

Annotated contigs(gbk/gff files)

Roary :Pan Genome Analysis

Enterobase BIGSdb

Nullabor

PHYLOViZ:Tree + metada visualization

Microreact.org: Tree +metadata +vizualization

Prok

ka

De novo assembler

Prokka Genome annotation made easy by

Torsten Seemann (slides by Torsten) Genome annotation: adding

biological information to the sequence, by describing features

To know more :http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013

Available at: https://github.com/tseemann/prokka

Roary Pan genome analysis by Andrew Page Available at: https://sangerpathogens.github.io/Roary/

Core genome

Accessory genome

Pan-genome

Roary Inputs: Annotated de novo assemblies (GFF files)

• Typically from the annotation pipeline

Outputs:• Spreadsheet with presence and absence of genes• Multi-FASTA alignment of core genes so you can build a tree without a

reference• Multi-FASTA alignments for each gene• Plots for the open/closed genome, unique genes• Integrates with Phandango so you can visualise all structural variation• QC report from Kraken to help identify suspect samples

(Slide by Andrew Page)

Roary outputs

Core (n or n-1 strains)

Soft-Core (n-2 or n-3 strains)

Shell( 8(?) to n-3 strains)

Cloud( <8 (?) strains)

Core genome:Core + Soft-Core

Accessory genome:Shell + Cloud

Roary outputs

iCANDY output of presence and absence of genes in accessory genome.S. Weltevreden & public S. enterica genomes

(Slide by Andrew Page)

Nullarbor Complete pipeline from reads to reports by Torsten

Seemann

Objective is automate analysis for everyday use on public health labs /research settings

Uses and distills outputs by a lot of software

Avaliable at: https://github.com/tseemann/nullarbor

Nullarbor

Slide by Torsten Seeman

Nullarbor

From: https://github.com/tseemann/nullarbor

Some Nullarbor outputs in report

Slides by Torsten Seeman

PHYLOViZwww.phyloviz.net

PHYLOViZInputs:- Tab separated txt

(profiles)- Fasta files- Automatic database

retrieval (MLST) Outputs:• goeBURST and

goeBURST MST• Link quality assessment• High quality images

Can be easily applied to:- MLST/ cgMLST/wgMLST- MLVA- SNP data*- Gene Presence/absence

PHYLOViZ 2.0

New features: • Hierarchical clustering • Neighbor-Joining• Project Saving

PHYLOViZ Online Available at http://online.phyloviz.net

Web based version of PHYLOViZ

Allows users to create their own datasets, save them and share their data (privately or publicly)

REST API available

Scalable to thousands of nodes

Tree Analysis tools: Interactive distance matrix NLV graph

PHYLOViZ Online

Slide by @happy_khan

PHYLOViZ Online

PHYLOViZ Online

NLV Graph

Tree cut-off

Full MST

microreact.org

microreact.org

microreact.org

Create Selections

Change tree options

microreact.org Available at http://microreact.org/

Presentation on session Harnessing whole genome sequence data for public health applications : Novel open access tools for WGS-based pathogen surveillance and the identification of high-risk clones

http://eccmidlive.org/#resources/novel-open-access-tools-for-wgs-based-pathogen-surveillance-and-the-identification-of-high-risk-clones

Meet The Experts (available on twitter by order of appearance)

Commercial solutions

• Ridom Seqsphere+ : http://www.ridom.de/seqsphere/ • Applied Maths Bionumerics 7.6: http://www.applied-maths.com/bionumerics• CLCBio Genomic Workbench : http://www.clcbio.com/blog/clc-genomics-workbench-7-5/

Take home messages• Huge variety of software and database

solutions

• There is no single One-Size-Fits-All solution (job security for bioinformaticians)

• Different questions require different approaches

• Always question the results and data provenance

ECCMID2015 Meet-the-expert session on “What bioinformatic tools should I use for analysis of High Throughput Sequencing data for molecular diagnostics? ”

Nick Loman: http://www.slideshare.net/nickloman/eccmid-2015-meettheexpert-bioinformatics-tools

João André Carriço: http://www.slideshare.net/joaoandrecarrico/eccmid-meet-theexpert2015

More references/presentations

Acknowledgments UMMI Members

Bruno Gonçalves Mário Ramirez José Melo-Cristino

INESC-ID Alexandre Francisco Cátia Vaz Marta Nascimento

EFSA INNUENDO Project (https://sites.google.com/site/innuendocon/) Mirko Rossi

FP7 PathoNGenTrace (http://www.patho-ngen-trace.eu/): Dag Harmsen (Univ. Muenster) Stefan Niemann (Research Center Borstel) Keith Jolley, James Bray and Martin Maiden (Univ. Oxford) Joerg Rothganger (RIDOM) Hannes Pouseele (Applied Maths)

Genome Canada IRIDA project (www.irida.ca) Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar (NLM , PHAC) Ed Taboada and Peter Kruczkiewicz (Lab Foodborne Zoonoses, PHAC) Fiona Brinkman (SFU) William Hsiao (BCCDC) INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS