GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades,...

Post on 03-Aug-2020

0 views 0 download

Transcript of GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades,...

1

GalaxyTrakr: Development of an

Accessible Cloud-based

Bioinformatics Platform

James Pettengill

Geneticist, Biostatistics and Bioinformatics Staff

Center for Food Safety and Applied Nutrition

US Food and Drug Administration

Food Safety & High-Throughput Sequencing (HTS)

Institute for Food Safety and Health

May 31, 2018

2

Outline:

1. Galaxy: a user-friendly interface for bioinformatics

• Introduction to Galaxy

• GalaxyTrakr Overview

• GalaxyTrakr Tools

3

Outline:

1. Galaxy: a user-friendly interface for bioinformatics

• Introduction to Galaxy

• GalaxyTrakr Overview

• GalaxyTrakr Tools

2. CFSAN cgMLST: rapid screen and clustering of

isolates

• Internal rapid identification of SNP clusters for outbreak

analyses

• Resource for others/industry

4

What is Galaxy?

5

Why Galaxy?

- Has a graphical user interface (GUI) so does not require

command line experience

- Active community of developers/users sharing the tools they

have developed or ported to Galaxy*

- Access programs through the Galaxy Tool Shed

6

Summary of Galaxy on AWS

• Galaxy has an Academic Free License.

– https://galaxyproject.org/

• Installed on a cloud formation cluster master node.

• Submits jobs to compute cluster via Grid Engine.

• Compute clusters are elastic, based on demand.

• Storage is elastic and accessible from multiple master

nodes.

• Two options for installation on AWS:

– https://aws.amazon.com/hpc/cfncluster/ **

– https://galaxyproject.org/cloudman/getting-started/

7

GalaxyTrakr Tools

• NGS QC and Manipulation– Trimmomatic, FastQC

• NGS Mapping– Bowtie2, Short Read Sequencer Typer (v2), BWA and BWA-MEM, Neptune Signature

Discovery

• NGS Assembly– Plasmidspades, SPAdes, Quast

• NGS Screening and Prediction– Seqsero v1 and v2, Seqsero Batch Paired-End Reads, Sistr cmd, BTyper, MLST, ABRicate

• Data Input– Direct from NCBI in Pileup, BAM or FASTA/Q format

– Upload from local computer via secure FTP or via GalaxyTrakr web interface

• Data Output– Download from GalaxyTrakr web interface

– Download via FTP

• Reference based variant detection– CFSAN SNP Pipeline

8

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

9

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

10

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

• Cost with current load is approximately $6500 a month,

initial target was $10000 a month

11

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

• Cost with current load is approximately $6500 a month,

initial target was $10000 a month

• Custom software for automated monitoring and

management, less than 1 full-time staff member

managing IT services - detailed Custom Dashboard:

http://dash.galaxytrakr.org/

12

An example with SeqSero:

• Uses whole genome sequence data to predict serotype.

• Useful tool for QA/QC

• Maps reads to database of antigen alleles using BWA in multiple steps.

• Chooses alleles to which more reads mapped.

• Uses BLAST to clear up ambiguities.

13

Galaxy homepage

14

Upload your data

15

Choose your data

16

Run it

17

Wait…

18

View and Download the Results

19

CFSAN SNP Pipeline

20

cgMLST:core genome multi-locus sequence type

Annotated Reference

Collection of reference genomes

Annotated Reference

Annotated Reference

All against all

comparison of genes Identify single copy core genes

cgMLST database

Genome of interest

Genome of interest

Genome of interest

cgMLST creation

Annotate with

PROKKA Annotation

Annotation

Annotation

Isolate and compare cgMLST loci to determine

closest isolates

cgMLST in practice

21

21

SNP detection/calling

(reference)

Open-reading frame

annotation (de-novo);

Presence/absence and

extended MLST

Whole-chromosome

alignment (de-novo)

cgMLST has the potential to incorporate many

approaches…

…providing extreme sensitivity and

valuable genotypic and phenotypic

information.

22

cgMLST: rapid screening of GenomeTrakr/Pathogen

database to identify similar isolates

23

Summary:

1. GalaxyTrakr: a user-friendly interface for bioinformatics

• Open source

• Lots of tools

• Activity community and support available

• CFSAN’s GalaxyTrakr may not be suitable for industry as it’s intended for public data – but industry could stand up own Galaxy instance inhouse.

2. CFSAN cgMLST: rapid screen and clustering of isolates

• Internal rapid identification of SNP clusters for outbreak analyses

• Resource for others/industry – requires some bioinformatic expertise

24

Acknowledgements

GalaxyTrakr

• James Sanders**

• Charles Strittmatter

• Justin Payne

• Errol Strain

• Hugh Rand

cgMLST

• Arthur Pightling