Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS...

26
Boston University / Globus Genomics Proof of Concept Results and Overview November 21, 2013

Transcript of Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS...

Page 1: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

Boston University / Globus Genomics Proof of Concept Results and Overview

November 21, 2013

Page 2: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab

• We are a non-profit organization building solutions for non-profit researchers

• Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions

Who We Are

Page 3: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Challenges in Sequencing Analysis

Sequencing Centers

Sequencing Centers

Data Movement and Access Challenges

Manual Data Analysis

PublicData

Storage

Local Cluster/CloudSeq

Center

Research Lab

• Data is distributed in different locations

• Research labs need access to the data for analysis

• Be able to Share data with other researchers/collaborators• Inefficient ways of data movement

• Data needs to be available on the local and Distributed Compute Resources

• Local Clusters, Cloud, Grid

How do we analyze this Sequence Data

Once we have the Sequence Data

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

• Manually move the data to the Compute node

(Re)Run Script

Install

Modify

• Install all the tools required for the Analysis• BWA, Picard, GATK, Filtering Scripts, etc.

• Shell scripts to sequentially execute the tools• Manually modify the scripts for any change

• Error Prone, difficult to keep track, messy..• Difficult to maintain and transfer the knowledge

FTP, SCP, HTTP

SCPFT

P, SC

P, HTT

P

Page 4: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Globus Genomics

Sequencing Centers

Sequencing Centers

PublicData

Storage

Local Cluster/CloudSeq

Center

Research Lab

Globus Provides a• High-performance • Fault-tolerant• Secure

file transfer Service between all data-endpoints

Data Management Data Analysis

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

Galaxy Data Libraries

• Globus Integrated within Galaxy

• Web-based UI• Drag-Drop workflow

creations• Easily modify Workflows

with new tools

Globus Genomics on Amazon EC2

• Analytical tools are automatically run on the scalable compute resources when possible

Galaxy Based Workflow Management System

FTP, SCP, others

FTP, SCPSCP

Globus Genomics

FTP,

SCP,

HTTP

Page 5: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Globus integrated with Galaxy – A flexible, scalable, simplified analysis platform

Accessibility• Unified Web-interface for obtaining genomic data and applying computational

tools to analyze the data• Easily integrate your own tools and scripts for analysis • Collection of tools (Tools Panel) that reflect good practices and community

insights• Access every step of analysis and intermediate results:

§ View, Download, Visualize, Reuse (History Panel)Reproducibility

• Track provenance and ensure repeatability of each analysis step: § input datasets, tools used, parameter values, and output datasets

• Intuitive Workflow Editor to create or modify complex workflows and use them as templates – Reusable and Reproducible

Transparency• Publish and share metadata, histories, and workflows at multiple levels• Store public and generated datasets as Data Libraries – e.g: hg19 Ref Genome• Shared datasets and workflows can be imported by other users for reuse

Publish

Templates

Data and Tools

Globus Integration• Access Globus Endpoints and transfer data from within Galaxy UI and into Galaxy workspace• Leverage local cluster or cloud based scalable computational resources for parallelizing the

tools

Page 6: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Globus Overview

• No IT required– Software as a Service (SaaS)

• No client software installation• New features automatically available

– Consolidated support & troubleshooting– Works with existing GridFTP servers– Globus Connect solves “last mile problem”

• GridFTP-based– Open source and freely available– Provides a comprehensive security model– Defacto standard for data movement in large national

cyberinfrastructure projects

• >10,000 registered users, >28 PB moved• Recommended and used by DOE Facilities, NSF

Supercomputing centers, and many campuses

Page 7: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities

• Data movement is streamlined with integrated Globus file-transfer functionality

• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure

Globus Genomics

Page 8: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Professionally managed and supported platform• Best practice pipelines• Enhanced workbench with breadth of analytic tools• Technical support and bioinformatics consulting• Access to pre-integrated end-points for reliable and high-

performance data transfer (e.g. Broad Institute, Perkin Elmer, university sequencing centers, etc.)

• Cost-effective solution with subscription-based pricing

Additional Capabilities

Page 9: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Worked closely with Andi, Charlie, Adam and Andy• Setup a data movement capability with Globus Transfer• Implemented an RNA-Seq pipeline using Globus

Genomics • Setup a scalable analysis instance utilizing Amazon Web

Services• Completed analysis runs on various small test data sets• Completed analysis runs on multiple 48 sample data

sets

RNA-Seq Pipeline – Proof of Concept

Page 10: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Huntington’s Disease mRNA-Seq Study

• Huntington’s Disease (HD) mRNA-Seq Study– 48 samples: 21 HD and 27 controls– Total mRNA extracted from prefrontal cortex of postmortem human

brain

• Sequencing Dataset– Illumina TruSeq protocol (poly-A tail mRNA selection)– HiSeq 2000 @ Tufts sequencing center– 101 nucleotides, paired-end reads, unstranded, multiplexed 3/lane– Average of 83M reads per sample (55,810,684 – 167,044,880)– ~350 Gb of data

Page 11: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

mRNA-Seq Analysis WorkflowSingle Sample

Page 12: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

mRNA-Seq Analysis WorkflowAll Samples

Page 13: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

RNA-Seq Analysis Workflow – Globus Genomics (Stage 1- Alignment per sample)

Page 14: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

RNA-Seq Analysis Workflow – Globus Genomics (Stage 2 – Differential Expression)

Page 15: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

POC Results – Trial Test Data Set

Page 16: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• 48 sample run on RNA-Seq pipeline took ~ 48 hours– Submitted two batches of 25 and 23 samples in parallel– Each batch completed in ~24 hours

• Average cost per RNA-Seq analysis ~ $9.50 / sample + storage

• Expectation is to be able to scale-out the analysis to handle 100+ data sets in parallel

POC Results – Final Test Data Set

Page 17: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Globus Genomics Demonstration

Page 18: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Flexibility to add tools (choose amongst 500+ applications or add your own)

• Custom build pipelines from scratch, utilize existing best practice pipelines or refine pipelines as needed

• Run at very large scale• Streamline data movement between sequencing

centers and collaborators with high performance, secure transfers and sophisticated data sharing

• Move from a POC setup to a production environment

Post POC – Next Steps

Page 19: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Security Considerations

Globus Genomics compliance with the NCBI Database of Genotypes and Phenotypes (dbGaP) security best practicesProtecting the Security of Controlled Data on Servers

Requirement: Servers must not be accessible directly from the internet and unnecessary services should be disabled.

Ø All Globus Genomics servers are protected by Amazon Security Groups and by stateful packet inspection firewalls. Only necessary services are allowed

Requirement: Keep systems up to date with security patches.

Ø All relevant security patches are applied as soon as they are available.

Requirement: dbGaP data on the systems must be secured from other users and if exported via file sharing, ensure limited access to remote systems.

Ø Globus Genomics and Globus Online provide sharing solutions that are secure and user controlled

Page 20: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Security Considerations

Globus Genomics compliance with the NCBI Database of Genotypes and Phenotypes (dbGaP) security best practices

Requirement: If accessing system remotely, encrypted data access must be used.

Ø Globus Genomics uses HTTPS and GridFTP protocol with authentication and encryption when transferring the files

Requirement: Ensure that all users of this data have IT security training suitable for this data access and understand the restrictions and responsibilities involved in access to this data.

Ø Data access is strictly restricted to individual users and only users can share the data with other users. We provide detailed instructions to our users on data security and access control.

Requirement: If data is used on multiple systems, ensure that data access policies are retained throughout the processing of the data on all the other systems. If data is cached on local systems, directory protection must be kept, and data must be removed when processing is complete.

Ø The data sharing and access policies on Globus Genomics are retained across all the systems involved

Page 21: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Subscription Pricing

Entry Level Basic Standard

Typical Workload (monthly)

~ 50 exomes ~ 10 whole genomes~ 100 RNA-seqs

~ 250 exomes ~ 50 whole genomes~ 500 RNA-seqs

~ 1000 exomes ~ 200 whole genomes~ 2000 RNA-seqs

Technical Support M-F, 9-5 CTBest Effort

M-F, 9-5 CT,2-business day response

M-F, 9-5 CT1-business day response

Support Channels Email Email + Phone Email + Phone

Support Contacts 1 2 5

Batch Submission Capability

No Yes Yes

Branding No No Yes

Monthly Pricing* $500 $1,500 $5,000

Annual Pricing* $5,000 $13,500 $50,000

* Does not include AWS costs

Page 22: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Example Collaborations

Cox LabBackground: A computational lab focusing on identification and characterization of genetic variation influencing susceptibility to complex disorders.

Approach: Develop a consensus variant calling approach to improve quality and confidence in identified variants from multiple variant calling applications.

Results: Consensus caller generated a high quality list of variants (less than 0.01% mendel error rate) for 134 samples in 4 days.

Future Plans: Apply consensus caller to 13,000 exome samples from NDAR

Page 23: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Example Collaborations

Dobyns Lab

Backround: Investigate the nature and causes of a wide range of human developmental brain disorders

Approach: Replaced manual analysis with Globus Genomics

Results: Achieved greater than 20X speed-up in analysis of exome data

Future Plans: Leverage scale-out capability of Globus Genomics on 150 exome data set and seek to achieve 50X speed-up in analysis

Page 24: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Example Collaborations

Georgetown Medical CenterBackround: Innovation Center for Biomedical Informatics is an academic hub for innovative research in the field of biomedical informatics.

Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging elastic cloud-based resources.

Results: Pilot effort is nearly complete – improved quality and performance results on whole genome, exome and RNA-Seq pipelines utilizing Globus Genomics

Future Plans: Provide Globus Genomics as a well-managed platform-as-a-service for ICBI collaborators and users

Page 25: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

• Successful Proof of Concept

• Look into ways where Globus can streamline data movement and aid in data management

– Access Globus Endpoints and data from within Galaxy UI and into Galaxy workspace– High-performance, reliable data transfer protocol optimized for high-bandwidth wide-

area networks

• Leverage the enhanced Galaxy based solution for NGS analysis needs– Collection of tools and workflows that reflect good practices and community insights– Intuitive Workflow Editor to create or modify complex workflows and use them as

templates – Reusable and Reproducible

• Utilize elastic cloud based computational resources to execute analysis at large scale

• Deliver an advanced, well managed, well supported genomics analysis platform to enable researchers to focus on their research and not IT

Summary

Page 26: Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging

www.globus.org/genomics

Questions?