A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Jafar Shameem

Amazon Web Services

November 14, 2013

The Problem and Promise of Translational Genetics and a

Step to the Clouded Solution of Scalable Clinical Whole

Genome Sequencing

Agenda

• Bio-Informatics and Amazon Web Services

• Examples of collaboration

• Building Blocks – Compute

– Storage

– Tools

– Pricing Models

A rich history of collaboration with Life Sciences organizations

• A centralized repository of public datasets

• Seamless integration with cloud based applications

• No charge to the community

• Some of the datasets available today:

– 1000 Genomes Project

– Human Microbiome Project

– Ensembl

– GenBank

– Illumina – Jay Flateley Human Genome Dataset

– YRI Trio Dataset

– The Cannabis Sativa Genome

– UniGene

– Influenza Virrus

– PubChem

• Tell us what else you’d like for us to host …

AWS Public Data Sets

CHARGE Consortium

- aimed at better understanding how human genetics contributes to heart disease

and aging

DNANexus

Baylor College of Medicine

Understanding how human genetics contributes to heart disease and aging

Mem

ory

(GiB

)

Small 1.7 GB,

1 EC2 Compute Unit

1 virtual core

Micro 613 MB

Up to 2 ECUs

Large 7.5 GB

4 EC2 Compute Units

2 virtual cores

Extra Large 15 GB

8 EC2 Compute Units

4 virtual cores

Hi-Mem XL 17.1 GB

6.5 EC2 Compute Units

2 virtual cores

Hi-Mem 2XL 34.2 GB

13 EC2 Compute Units

4 virtual cores

Hi-Mem 4XL 68.4 GB


8 virtual cores

High-CPU Med 1.7 GB

5 EC2 Compute Units

2 virtual cores

High-CPU XL 7 GB


8 virtual cores

Cluster GPU 4XL 22 GB

33.5 EC2 Compute Units,

2 x NVIDIA Tesla “Fermi”

M2050 GPUs

Cluster Compute 4XL 23 GB

33.5 EC2 Compute Units

Medium 3.7 GB,

2 EC2 Compute Units

1 virtual core

High I/O 4XL 60.5 GB, 35

EC2 Compute Units,

2*1024 GB SSD-based

local instance storage

High Storage 8XL 117 GB


24 * 2 TB instance store

Cluster High Mem 8XL


244 GB SSD instance storage

EC2 Compute Units

Cluster Compute 8XL 60.5 GB


Compute

Relational Database Service

Fully managed database

(MySQL, Oracle, MSSQL)

DynamoDB

NoSQL, Schemaless,

Provisioned throughput

database

S3

Object datastore up to 5TB

per object

99.999999999% durability

SimpleDB

NoSQL, Schemaless

Smaller datasets

Redshift

Petabyte scale

data warehousing service

Fully managed

Storage

• GATK

• NCBI BLAST

• Crossbow

• CloudBurst

• Myrna

• Clovr

• BioPerl Max

• VIPDAC

• Superfamily

• Cloud-Coffee

• BioNimbus

• GMOD

• CloudAligner

• BioConductor

• QIIME

• SNAP

• BWA

• Bowtie/TopHat/Cufflinks

• STAR, GSNAP, RUM

Get links to AMIs at:

https://github.com/mndoci/mndoci.github.com/wiki/Life-Science-Apps-on-AWS

MIT StarCluster Galaxy CloudMan Rocks

Torque Slurm Condor

Chef Puppet SaltStack

Tools of the trade

On-Demand

Pay for compute

capacity by the hour

with no long-term

commitments

For spiky workloads,

or to define needs

Many purchase models to support different needs

Reserved

Make a low, one-time

payment and receive a

significant discount on

the hourly charge

For committed

utilization

Spot

Bid for unused capacity,

charged at a Spot Price

which fluctuates based

on supply and demand

For time-insensitive or

transient workloads

Dedicated

Launch instances within

Amazon VPC that run

on hardware dedicated

to a single customer

For highly sensitive or

compliance related

workloads

Free Tier

Get Started on AWS

with free usage & no

commitment

For POCs and

getting started

Ideal Applications

Batch Processing

Time-Delayable

Fault-Tolerant or Restartable

Compute-Intensive

Horizontally Scalable

Stateless Worker Nodes

Region and AZ Independent

Uses Deployment Automation

Less Ideal Applications

Interactive

Strict/Tight SLA for Completion

Expensive to Handle Terminations

Data-Intensive

In-Memory Scaling

Long-Running Worker Nodes (weeks)

Requires a Single AZ

Manually Launched and Managed

How to use Spot?

Stanford University

Tractable, scalable, and economical processing of clinical whole genome sequences in AWS

Clinical Genomics for Cancer Diagnosis

Amazon Web Services Re-Invent 2013

Nov 14th, 2013 Las Vegas, NV

Peter J. Tonellato, PhD Harvard Medical School

Dennis P. Wall, PhD* Stanford University*

Stanford University

Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information

Clock time: < 3 hours Cost: < $100

Whole Genome Breast Cancer Program

Stanford University

1. Organization and Progress to date

2. Historical BIDMC Breast Cancer cases

3. Clinical Whole Genome Analysis – Laboratory Test

4. COSMOS: Clinical Whole Genome Analysis on AWS


Stanford University

Program Coordinator – Michiyo - Research Assistant * (Emily Poles?) - Technician *

N - MDBCTB

Assay - Preparation/Storage - DNA extraction/purification - Sample delivery (to outsource) - Whole genome sequencing -OncoScan v3 (BI) - DNA/RNA/NGS sequencing (outsource)

External Advisory Board Clinical Executive Committee Cancer Center Regulatory Affairs

Bioinformatics Peter & Dennis

Surgery Mike

Oncology Gerburg

Pathology Stu

Radiation Oncology

Abram

Imaging Tejas

Genetic Counseling

Jill

LPM - Sheida - Latrice - Jared - Yassine - Val - Michiyo

Bioinformatics - Data Transfer - Genome Data Integration/Management - Annotation - Analysis - Translation - Case Evidence Report

Oversee

Case management - Case Identification - Case review (N-MDBCTB) - Consent (RN) - Clinical Data Management - Tissue Collection - Sample Management - Follow-up Assay/Bioinformatics

Translation - Data Integration/Management - Case report (N-MDBCTB)

Genetics Nadine

Social Work Barbara

* Hiring

Stanford University

Research Pathology Lab

Diagnosis Work-up MMG US MRI Biopsy (Immunohistochemistry, FISH)

Analysis Lab (LPM)

Surgery

BIDMC Breast Cancer Patient/Sample Process

Analysis workflow

Case identification workflow

Sample workflow

Workflow

N-BCMDTB

Blood test

Chemotherapy Radiation therapy

Biopsy

OMR

- Clinical Data

- Follow up Outcome

Identification of Targeted Therapy

Personalized Medicine

Yes No

Surgery?

Yes

Presentation Case and Schedule

Case Evaluation BCMDTB

BIDMC Clinic (Oncology, Surgery, Radiation Oncology)

Clinical Evaluation

Patient flow

No

Adjuvant

Therapy

Surgery

After

NAC

Clinical Lab

Pathology Blood test lab

Biopsy specimen

Surgical specimen

Blood

sample

FFPE FF

Diagnoses

OMR

DNA, RNA

Extraction

Storage

FF

OMR

NAC: Neoadjuvant chemotherapy OMR: Online Medical Record FFPE: Formalin-Fixed, Paraffin-Embedded (tissue) FF: fresh frozen (tissue)

Consent

to care

Blood

sample

Tissue

specimen

Blood

sample

Tissue

sample

- DNA Sequencing - Exome sequencing - OncoScan v3™ Copy number Somatic Mutation

- Gene expression pipeline - OncoScan™ pipeline - SNP Chip pipeline - Integrative pipeline

Research

Research

Clinical

Diagnoses

X

X: No further treatment and research

*

**

Consent

to

Research

Yes -> * and ** No -> *

Storage

FFPE, FF

Case selection

N-BCMDTB Result Evaluation

Translation

Stanford University

Evaluation

Decision

Consent

Tissue Workup

No

Yes -> Undergo surgery

Yes -> Eligible

Yes -> Agreed

Clinical Workup

Analysis

Translation

Personalized Medicine

- Blood Test - Breast Surgery

- Pathology Workup - Sample Collection (Extract DNA/RNA from Tissue and Blood)

Assay - DNA genome sequencing - Exome sequencing - Copy number and somatic mutations analysis using an array platform (OncoScan)

- Analysis outcome data

- Discussion at NBCMDTB

- Identification of Targeted Therapy and Personalized Medicine

Excluded

No surgery Not eligible Disagreed

Poor sample

No

No

No

No

Clinical Outcome

Traditional Treatment

Clinicopathological Characteristic

- Treatment Decision

- Case Decision

- Getting consent

IRB Approved Protocol

Stanford University






Stanford University

Breast Cancer Clinical Use of WGA

1. Family and Individual Risk prediction

2. Breast Cancer Tumor Characterization

3. Breast Cancer Diagnosis

4. Breast Cancer Prognosis

5. Prediction of response to targeted

therapies

6. Indications of outcome and assessment for

future treatment refinement

Stanford University

Breast Cancer Genomic Devices

23andMe* deCODEme* BRACAnalysis* Ambry Genetics* CCDG Panel

SNaPshot MapQuant DX TheraPrint** NexCourse Bca Wash U Panel Target Now

Methyl-Profiler Rotterdam Signature MammoStrat BreastGeneDX Breast Cancer Array OncotypeDX* Breast Cancer Index

OncoScan TargetPrint BluePrint** PAM50* BreastProfile* Her2Pro* MammaPrint

OncoMap3** AsuraSeq-1000** OncoCarta**

Risk Prediction

Research

Prognosis

35 devices reviewed; 26 used clinically

*Associated CPT/CMS codes **Not for clinical use

Stanford University

Clinically Actionable Breast Cancer

Information

Data Type # Unique Entries

Gene 773

SNP 1733

Small Insertion 75

Small Deletion 205

Translocation 3

Gene Expression 383

Protein Expression 7

Amplification 64

Deletion 48

Total “Clinically”

Actionable 3291

52 SNPs for risk prediction. 1681

SNPs for prognosis

Drug target commonly based on

gene expression profile

HER2, Estrogen, Progesterone

receptor status

9 Deletions in BRCA1 or BRCA2

detected by BRACAnalysis confer

increased breast cancer risk

Stanford University




3. Clinical Whole Genome Analysis (WGA) – Laboratory Test


Stanford University

Patients Samples

Next Generation Sequencers

Biomedical Report

Clinical Genomics Interpretation Service

Clinical Report

Clinical WGA Workflow

Bioinformatics Analysis

Stanford University

Pre-clinical and clinical

variant annotation

DNA-Seq

RNA-Seq

miRNA

Methyl

CNV-seq ReadDepth

Segseq

Tophat Cufflinks

BLAST

miRNAkey miRBase

Bismark

SNP/indel

CNV

Gene Exp.

miRNA targets

% Gene Methyl

Pathway Analysis

Classification (Tumor, disease)

Risk Prediction

BWA GATK Picard

Stanford University

• NGS platforms: 5,000 Megabases/day

• Drop of the per-base sequencing cost

• Data on petabyte scale • NGS analysis involves

complex workflows

Reduced Cost of Next Generation Sequencing (NGS)

Stanford University

WGA in “Clinical Turn-around” – Future

Sample Collection

Sequencing

Analysis

Clinical Action

12 hours

500 hours

12 hours

40 hours < 3 hours < $100

Stanford University

Current Costs to Run on Amazon Web Services

Details: 1 Whole Genome 60x Spot and Reserved Instances Utilizing Amazon Glacier for long term storage

Whole Genome Analysis: Approximately 1 day Approximately $1500

Stanford University




4. COSMOS: Clinical Whole Genome Analysis on AWS • AWS

• Applications

• Workflow

• COSMOS


Stanford University

Four approaches to optimize and achieve our

Clinical Turn-Around Objective:

• AWS

• Refine and Improve WGA Applications

• Create a Standardized, Robust CWGA Workflow

• Stabilize a new Workflow and Distributive Computing Platform: COSMOS

Clinical Whole Genome Analysis Computational Objective:

< 3 hours < $100

Stanford University

EC2 instances

AMIs

S3 storage

Dynamic Cluster with number and the type of instances adapted to data-sets, jobs, and applications.

BAM BAM BAM

On-demand Master(s)

Load Balanced Spot Instance Workers

Stanford University

EC2 instances

AMIs

S3 storage

Optimization: Correct type and number of EC2s and cluster Current non-optimized Master: CC2.8xlarge

High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage

Current non-optimized Worker: CC2.8xlarge

High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage

Stanford University

EC2 instances

AMIs

S3 storage

Create stable CWGA AMI(s) Required Applications, libraries and

dependencies:

Applications (GATK): Samtools, BWA, … Human Reference Genome Annotation Databases

Stanford University

EC2 instances

AMIs

S3 storage

Optimize: AMI

Compiler: GCC 4.6+ supports AVX mode Refined GCC parameters Compressed libraries: zlib and snappy Refined JAVA parameters for GATK optimization Memory: HugePage (2M) configured for every node/application Disks: Ephemeral: RAID 0 Cluster Disks: GlusterFS

Stanford University

EC2 instances

AMIs

S3 storage

S3 storage: • Storage of BAM files • Transfer of BAM and other files • “checkpoint” after each successful workflow stage • Backup of intermediate and final results • Storage of all timings and job information

Stanford University



• AWS





< 3 hours < $100

Stanford University


variant annotation

DNA-Seq

RNA-Seq

miRNA

Methyl

BWA GATK Picard

CNV-seq ReadDepth

Segseq

Tophat Cufflinks

BLAST

miRNAkey miRBase

Bismark

SNP/indel

CNV

Gene Exp.

miRNA targets

% Gene Methyl

Pathway Analysis


Risk Prediction

Stanford University

WGA Applications Genome Analysis Toolkit (GATK) “best practice”.

Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

Variant calling Annotation Preparation/Alignment

Stanford University

Applications Parallelization 5 exomes example

0

100

200

300

400

500

600

5 exome

5 exome


Stanford University

Alignment: Burrows-Wheeler Aligner

Stanford University



• AWS





< 3 hours < $100

Stanford University


variant annotation

DNA-Seq

RNA-Seq

miRNA

Methyl

CNV-seq ReadDepth

Segseq

Tophat Cufflinks

BLAST

miRNAkey miRBase

Bismark

SNP/indel

CNV

Gene Exp.

miRNA targets

% Gene Methyl

Pathway Analysis


Risk Prediction

BWA GATK Picard

Stanford University


variant annotation

DNA-Seq

RNA-Seq

miRNA

Methyl

BWA GATK Picard

CNV-seq ReadDepth

Segseq

Tophat Cufflinks

BLAST

miRNAkey miRBase

Bismark

SNP/indel

CNV

Gene Exp.

miRNA targets

% Gene Methyl

Pathway Analysis


Risk Prediction

Stanford University

GenomeKey Implements GATK "best practices" for variant calling.

Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices


GenomeKey

Stanford University

Databases Integrated

Stanford University

Databases Integrated

The_1000g_Febuary_all NHLBI_Exome_Project_euro

NHLBI_Exome_Project_aa

NHLBI_Exome_Project_all HGMD_INDEL

HGMD_SNP

COSMIC

GWAS_Catalog

ENCODE_DNaseI_Hypersensitivity

ENCODE_Transcription_Factor UCSC_Gene

Refseq_Gene

Ensembl_Gene

CCDS_Gene

DrugBank

CytoBank

dbSNP135

TFBS

Segmental_Duplications

RepeatMasker Self Chain

mirBase

TargetScan

SIFT

PolyPhen2

Mutation_Taster GERP

PhyloP

LRT

Mce46way

Complete_Genomics_69

Plus support for generic database file formats such as .bed and .gff3

Stanford University

Workflow Optimization

• Speed:

• Replacing BWA with SNAP (for the same accuracy)

• Re-implement some slow algorithms (e.g. BQSR)

• Accuracy:

• Add additional quality control steps

• Replacing Unified Genotyper with Haplotype Caller

Stanford University



• AWS





< 3 hours < $100

Stanford University

COSMOS

Instances Storage AWS


GenomeKey

Workflow management System

Job

splitting

Web

Interface

Job

tracking

COSMOS

OS & Software

EC2 and S3

Grid

engine Gluster FS MySQL DB Networking

Stanford University

COSMOS Parallelization

0

200

400

600

800

1000

1200

Nu

mb

er

of

Job

s

1 Exome

5 Exomes

10 Exomes

All Runs

Annotation Variant calling Preparation/Alignment

Stanford University

COSMOS Job Splitting

Stanford University

Job Dependency Tracking

PREPARATION / ALIGNMENT

VARIANT CALLING

ANNOTATION

Stanford University

COSMOS Web Interface

PREPARATION / ALIGNMENT

ANNOTATION

VARIANT CALLING

Stanford University



• AWS





< 3 hours < $100

Stanford University

Whole Exome Analysis Pre and Post-Optimization

Before

Before

Before

After

After

After

0

5

10

15

20

25

30

1 exome 5 exomes 10 exomes

Wal

l tim

e

~$27

~$10

~$48

~$27

~$90

~$47

Stanford University

Whole Exome Analysis:

Before

Before

Before

After

After

After

0

5

10

15

20

25

30

1 exome 5 exomes 10 exomes

Wal

l tim

e

~$27

~$10

~$48

~$27

~$90

~$47

Stanford University

Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information

Clock time: < 3 hours Cost: < $100


Stanford University

Acknowledgments

LPM (Tonellato) Erik Gafni (InVitae) Vince Fusaro (InVitae) Jared B. Hawkins Ryan Powles Yassine Souilmi

Wall lab (Harvard & Stanford University) Jae-Yoon Jung Alex Lancaster David Tulga

Autism Speaks 6000 Exomes (current) 10,000 Genomes

Ancient Human Genomes David Reich

Stanford University

Tractable, scalable, and economical processing of clinical whole genome sequences in AWS

Clinical Genomics for Cancer Diagnosis

Amazon Web Services Re-Invent 2013

Nov 14th, 2013 Las Vegas, NV

Peter J. Tonellato, PhD Harvard Medical School

Dennis P. Wall, PhD* Stanford University*

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013

Technology

Transcript of A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013