GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades,...
Transcript of GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades,...
1
GalaxyTrakr: Development of an
Accessible Cloud-based
Bioinformatics Platform
James Pettengill
Geneticist, Biostatistics and Bioinformatics Staff
Center for Food Safety and Applied Nutrition
US Food and Drug Administration
Food Safety & High-Throughput Sequencing (HTS)
Institute for Food Safety and Health
May 31, 2018
2
Outline:
1. Galaxy: a user-friendly interface for bioinformatics
• Introduction to Galaxy
• GalaxyTrakr Overview
• GalaxyTrakr Tools
3
Outline:
1. Galaxy: a user-friendly interface for bioinformatics
• Introduction to Galaxy
• GalaxyTrakr Overview
• GalaxyTrakr Tools
2. CFSAN cgMLST: rapid screen and clustering of
isolates
• Internal rapid identification of SNP clusters for outbreak
analyses
• Resource for others/industry
4
What is Galaxy?
5
Why Galaxy?
- Has a graphical user interface (GUI) so does not require
command line experience
- Active community of developers/users sharing the tools they
have developed or ported to Galaxy*
- Access programs through the Galaxy Tool Shed
6
Summary of Galaxy on AWS
• Galaxy has an Academic Free License.
– https://galaxyproject.org/
• Installed on a cloud formation cluster master node.
• Submits jobs to compute cluster via Grid Engine.
• Compute clusters are elastic, based on demand.
• Storage is elastic and accessible from multiple master
nodes.
• Two options for installation on AWS:
– https://aws.amazon.com/hpc/cfncluster/ **
– https://galaxyproject.org/cloudman/getting-started/
7
GalaxyTrakr Tools
• NGS QC and Manipulation– Trimmomatic, FastQC
• NGS Mapping– Bowtie2, Short Read Sequencer Typer (v2), BWA and BWA-MEM, Neptune Signature
Discovery
• NGS Assembly– Plasmidspades, SPAdes, Quast
• NGS Screening and Prediction– Seqsero v1 and v2, Seqsero Batch Paired-End Reads, Sistr cmd, BTyper, MLST, ABRicate
• Data Input– Direct from NCBI in Pileup, BAM or FASTA/Q format
– Upload from local computer via secure FTP or via GalaxyTrakr web interface
• Data Output– Download from GalaxyTrakr web interface
– Download via FTP
• Reference based variant detection– CFSAN SNP Pipeline
8
GalaxyTrakr Stats
• Currently 139 active users across 42 different locations
worldwide, adding about 15 users per week
9
GalaxyTrakr Stats
• Currently 139 active users across 42 different locations
worldwide, adding about 15 users per week
• Over 38,000 jobs processed to date, top users using
over 3500 hours and 11,000 CPU slots
10
GalaxyTrakr Stats
• Currently 139 active users across 42 different locations
worldwide, adding about 15 users per week
• Over 38,000 jobs processed to date, top users using
over 3500 hours and 11,000 CPU slots
• Cost with current load is approximately $6500 a month,
initial target was $10000 a month
11
GalaxyTrakr Stats
• Currently 139 active users across 42 different locations
worldwide, adding about 15 users per week
• Over 38,000 jobs processed to date, top users using
over 3500 hours and 11,000 CPU slots
• Cost with current load is approximately $6500 a month,
initial target was $10000 a month
• Custom software for automated monitoring and
management, less than 1 full-time staff member
managing IT services - detailed Custom Dashboard:
http://dash.galaxytrakr.org/
12
An example with SeqSero:
• Uses whole genome sequence data to predict serotype.
• Useful tool for QA/QC
• Maps reads to database of antigen alleles using BWA in multiple steps.
• Chooses alleles to which more reads mapped.
• Uses BLAST to clear up ambiguities.
13
Galaxy homepage
14
Upload your data
15
Choose your data
16
Run it
17
Wait…
18
View and Download the Results
19
CFSAN SNP Pipeline
20
cgMLST:core genome multi-locus sequence type
Annotated Reference
Collection of reference genomes
Annotated Reference
Annotated Reference
All against all
comparison of genes Identify single copy core genes
cgMLST database
Genome of interest
Genome of interest
Genome of interest
cgMLST creation
Annotate with
PROKKA Annotation
Annotation
Annotation
Isolate and compare cgMLST loci to determine
closest isolates
cgMLST in practice
21
21
SNP detection/calling
(reference)
Open-reading frame
annotation (de-novo);
Presence/absence and
extended MLST
Whole-chromosome
alignment (de-novo)
cgMLST has the potential to incorporate many
approaches…
…providing extreme sensitivity and
valuable genotypic and phenotypic
information.
22
cgMLST: rapid screening of GenomeTrakr/Pathogen
database to identify similar isolates
23
Summary:
1. GalaxyTrakr: a user-friendly interface for bioinformatics
• Open source
• Lots of tools
• Activity community and support available
• CFSAN’s GalaxyTrakr may not be suitable for industry as it’s intended for public data – but industry could stand up own Galaxy instance inhouse.
2. CFSAN cgMLST: rapid screen and clustering of isolates
• Internal rapid identification of SNP clusters for outbreak analyses
• Resource for others/industry – requires some bioinformatic expertise
24
Acknowledgements
GalaxyTrakr
• James Sanders**
• Charles Strittmatter
• Justin Payne
• Errol Strain
• Hugh Rand
cgMLST
• Arthur Pightling