A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
780 -
download
1
description
Transcript of A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Jafar Shameem
Amazon Web Services
November 14, 2013
The Problem and Promise of Translational Genetics and a
Step to the Clouded Solution of Scalable Clinical Whole
Genome Sequencing
Agenda
• Bio-Informatics and Amazon Web Services
• Examples of collaboration
• Building Blocks – Compute
– Storage
– Tools
– Pricing Models
A rich history of collaboration with Life Sciences organizations
• A centralized repository of public datasets
• Seamless integration with cloud based applications
• No charge to the community
• Some of the datasets available today:
– 1000 Genomes Project
– Human Microbiome Project
– Ensembl
– GenBank
– Illumina – Jay Flateley Human Genome Dataset
– YRI Trio Dataset
– The Cannabis Sativa Genome
– UniGene
– Influenza Virrus
– PubChem
• Tell us what else you’d like for us to host …
AWS Public Data Sets
CHARGE Consortium
- aimed at better understanding how human genetics contributes to heart disease
and aging
DNANexus
Baylor College of Medicine
Understanding how human genetics contributes to heart disease and aging
Mem
ory
(GiB
)
Small 1.7 GB,
1 EC2 Compute Unit
1 virtual core
Micro 613 MB
Up to 2 ECUs
Large 7.5 GB
4 EC2 Compute Units
2 virtual cores
Extra Large 15 GB
8 EC2 Compute Units
4 virtual cores
Hi-Mem XL 17.1 GB
6.5 EC2 Compute Units
2 virtual cores
Hi-Mem 2XL 34.2 GB
13 EC2 Compute Units
4 virtual cores
Hi-Mem 4XL 68.4 GB
26 EC2 Compute Units
8 virtual cores
High-CPU Med 1.7 GB
5 EC2 Compute Units
2 virtual cores
High-CPU XL 7 GB
20 EC2 Compute Units
8 virtual cores
Cluster GPU 4XL 22 GB
33.5 EC2 Compute Units,
2 x NVIDIA Tesla “Fermi”
M2050 GPUs
Cluster Compute 4XL 23 GB
33.5 EC2 Compute Units
Medium 3.7 GB,
2 EC2 Compute Units
1 virtual core
High I/O 4XL 60.5 GB, 35
EC2 Compute Units,
2*1024 GB SSD-based
local instance storage
High Storage 8XL 117 GB
35 EC2 Compute Units
24 * 2 TB instance store
Cluster High Mem 8XL
89 EC2 Compute Units
244 GB SSD instance storage
EC2 Compute Units
Cluster Compute 8XL 60.5 GB
88 EC2 Compute Units
Compute
Relational Database Service
Fully managed database
(MySQL, Oracle, MSSQL)
DynamoDB
NoSQL, Schemaless,
Provisioned throughput
database
S3
Object datastore up to 5TB
per object
99.999999999% durability
SimpleDB
NoSQL, Schemaless
Smaller datasets
Redshift
Petabyte scale
data warehousing service
Fully managed
Storage
• GATK
• NCBI BLAST
• Crossbow
• CloudBurst
• Myrna
• Clovr
• BioPerl Max
• VIPDAC
• Superfamily
• Cloud-Coffee
• BioNimbus
• GMOD
• CloudAligner
• BioConductor
• QIIME
• SNAP
• BWA
• Bowtie/TopHat/Cufflinks
• STAR, GSNAP, RUM
Get links to AMIs at:
https://github.com/mndoci/mndoci.github.com/wiki/Life-Science-Apps-on-AWS
MIT StarCluster Galaxy CloudMan Rocks
Torque Slurm Condor
Chef Puppet SaltStack
Tools of the trade
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
Many purchase models to support different needs
Reserved
Make a low, one-time
payment and receive a
significant discount on
the hourly charge
For committed
utilization
Spot
Bid for unused capacity,
charged at a Spot Price
which fluctuates based
on supply and demand
For time-insensitive or
transient workloads
Dedicated
Launch instances within
Amazon VPC that run
on hardware dedicated
to a single customer
For highly sensitive or
compliance related
workloads
Free Tier
Get Started on AWS
with free usage & no
commitment
For POCs and
getting started
Ideal Applications
Batch Processing
Time-Delayable
Fault-Tolerant or Restartable
Compute-Intensive
Horizontally Scalable
Stateless Worker Nodes
Region and AZ Independent
Uses Deployment Automation
Less Ideal Applications
Interactive
Strict/Tight SLA for Completion
Expensive to Handle Terminations
Data-Intensive
In-Memory Scaling
Long-Running Worker Nodes (weeks)
Requires a Single AZ
Manually Launched and Managed
How to use Spot?
Stanford University
Tractable, scalable, and economical processing of clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diagnosis
Amazon Web Services Re-Invent 2013
Nov 14th, 2013 Las Vegas, NV
Peter J. Tonellato, PhD Harvard Medical School
Dennis P. Wall, PhD* Stanford University*
Stanford University
Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information
Clock time: < 3 hours Cost: < $100
Whole Genome Breast Cancer Program
Stanford University
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Whole Genome Breast Cancer Program
Stanford University
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Whole Genome Breast Cancer Program
Stanford University
Program Coordinator – Michiyo - Research Assistant * (Emily Poles?) - Technician *
N - MDBCTB
Assay - Preparation/Storage - DNA extraction/purification - Sample delivery (to outsource) - Whole genome sequencing -OncoScan v3 (BI) - DNA/RNA/NGS sequencing (outsource)
External Advisory Board Clinical Executive Committee Cancer Center Regulatory Affairs
Bioinformatics Peter & Dennis
Surgery Mike
Oncology Gerburg
Pathology Stu
Radiation Oncology
Abram
Imaging Tejas
Genetic Counseling
Jill
LPM - Sheida - Latrice - Jared - Yassine - Val - Michiyo
Bioinformatics - Data Transfer - Genome Data Integration/Management - Annotation - Analysis - Translation - Case Evidence Report
Oversee
Case management - Case Identification - Case review (N-MDBCTB) - Consent (RN) - Clinical Data Management - Tissue Collection - Sample Management - Follow-up Assay/Bioinformatics
Translation - Data Integration/Management - Case report (N-MDBCTB)
Genetics Nadine
Social Work Barbara
* Hiring
Stanford University
Research Pathology Lab
Diagnosis Work-up MMG US MRI Biopsy (Immunohistochemistry, FISH)
Analysis Lab (LPM)
Surgery
BIDMC Breast Cancer Patient/Sample Process
Analysis workflow
Case identification workflow
Sample workflow
Workflow
N-BCMDTB
Blood test
Chemotherapy Radiation therapy
Biopsy
OMR
- Clinical Data
- Follow up Outcome
Identification of Targeted Therapy
Personalized Medicine
Yes No
Surgery?
Yes
Presentation Case and Schedule
Case Evaluation BCMDTB
BIDMC Clinic (Oncology, Surgery, Radiation Oncology)
Clinical Evaluation
Patient flow
No
Adjuvant
Therapy
Surgery
After
NAC
Clinical Lab
Pathology Blood test lab
Biopsy specimen
Surgical specimen
Blood
sample
FFPE FF
Diagnoses
OMR
DNA, RNA
Extraction
Storage
FF
OMR
NAC: Neoadjuvant chemotherapy OMR: Online Medical Record FFPE: Formalin-Fixed, Paraffin-Embedded (tissue) FF: fresh frozen (tissue)
Consent
to care
Blood
sample
Tissue
specimen
Blood
sample
Tissue
sample
- DNA Sequencing - Exome sequencing - OncoScan v3™ Copy number Somatic Mutation
- Gene expression pipeline - OncoScan™ pipeline - SNP Chip pipeline - Integrative pipeline
Research
Research
Clinical
Diagnoses
X
X: No further treatment and research
*
**
Consent
to
Research
Yes -> * and ** No -> *
Storage
FFPE, FF
Case selection
N-BCMDTB Result Evaluation
Translation
Stanford University
Evaluation
Decision
Consent
Tissue Workup
No
Yes -> Undergo surgery
Yes -> Eligible
Yes -> Agreed
Clinical Workup
Analysis
Translation
Personalized Medicine
- Blood Test - Breast Surgery
- Pathology Workup - Sample Collection (Extract DNA/RNA from Tissue and Blood)
Assay - DNA genome sequencing - Exome sequencing - Copy number and somatic mutations analysis using an array platform (OncoScan)
- Analysis outcome data
- Discussion at NBCMDTB
- Identification of Targeted Therapy and Personalized Medicine
Excluded
No surgery Not eligible Disagreed
Poor sample
No
No
No
No
Clinical Outcome
Traditional Treatment
Clinicopathological Characteristic
- Treatment Decision
- Case Decision
- Getting consent
IRB Approved Protocol
Stanford University
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Whole Genome Breast Cancer Program
Stanford University
Breast Cancer Clinical Use of WGA
1. Family and Individual Risk prediction
2. Breast Cancer Tumor Characterization
3. Breast Cancer Diagnosis
4. Breast Cancer Prognosis
5. Prediction of response to targeted
therapies
6. Indications of outcome and assessment for
future treatment refinement
Stanford University
Breast Cancer Genomic Devices
23andMe* deCODEme* BRACAnalysis* Ambry Genetics* CCDG Panel
SNaPshot MapQuant DX TheraPrint** NexCourse Bca Wash U Panel Target Now
Methyl-Profiler Rotterdam Signature MammoStrat BreastGeneDX Breast Cancer Array OncotypeDX* Breast Cancer Index
OncoScan TargetPrint BluePrint** PAM50* BreastProfile* Her2Pro* MammaPrint
OncoMap3** AsuraSeq-1000** OncoCarta**
Risk Prediction
Research
Prognosis
35 devices reviewed; 26 used clinically
*Associated CPT/CMS codes **Not for clinical use
Stanford University
Clinically Actionable Breast Cancer
Information
Data Type # Unique Entries
Gene 773
SNP 1733
Small Insertion 75
Small Deletion 205
Translocation 3
Gene Expression 383
Protein Expression 7
Amplification 64
Deletion 48
Total “Clinically”
Actionable 3291
52 SNPs for risk prediction. 1681
SNPs for prognosis
Drug target commonly based on
gene expression profile
HER2, Estrogen, Progesterone
receptor status
9 Deletions in BRCA1 or BRCA2
detected by BRACAnalysis confer
increased breast cancer risk
Stanford University
Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis (WGA) – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Stanford University
Patients Samples
Next Generation Sequencers
Biomedical Report
Clinical Genomics Interpretation Service
Clinical Report
Clinical WGA Workflow
Bioinformatics Analysis
Stanford University
Pre-clinical and clinical
variant annotation
DNA-Seq
RNA-Seq
miRNA
Methyl
CNV-seq ReadDepth
Segseq
Tophat Cufflinks
BLAST
miRNAkey miRBase
Bismark
SNP/indel
CNV
Gene Exp.
miRNA targets
% Gene Methyl
Pathway Analysis
Classification (Tumor, disease)
Risk Prediction
BWA GATK Picard
Stanford University
• NGS platforms: 5,000 Megabases/day
• Drop of the per-base sequencing cost
• Data on petabyte scale • NGS analysis involves
complex workflows
Reduced Cost of Next Generation Sequencing (NGS)
Stanford University
WGA in “Clinical Turn-around” – Future
Sample Collection
Sequencing
Analysis
Clinical Action
12 hours
500 hours
12 hours
40 hours < 3 hours < $100
Stanford University
Current Costs to Run on Amazon Web Services
Details: 1 Whole Genome 60x Spot and Reserved Instances Utilizing Amazon Glacier for long term storage
Whole Genome Analysis: Approximately 1 day Approximately $1500
Stanford University
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS • AWS
• Applications
• Workflow
• COSMOS
Whole Genome Breast Cancer Program
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
EC2 instances
AMIs
S3 storage
Dynamic Cluster with number and the type of instances adapted to data-sets, jobs, and applications.
BAM BAM BAM
On-demand Master(s)
Load Balanced Spot Instance Workers
Stanford University
EC2 instances
AMIs
S3 storage
Optimization: Correct type and number of EC2s and cluster Current non-optimized Master: CC2.8xlarge
High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage
Current non-optimized Worker: CC2.8xlarge
High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage
Stanford University
EC2 instances
AMIs
S3 storage
Create stable CWGA AMI(s) Required Applications, libraries and
dependencies:
Applications (GATK): Samtools, BWA, … Human Reference Genome Annotation Databases
Stanford University
EC2 instances
AMIs
S3 storage
Optimize: AMI
Compiler: GCC 4.6+ supports AVX mode Refined GCC parameters Compressed libraries: zlib and snappy Refined JAVA parameters for GATK optimization Memory: HugePage (2M) configured for every node/application Disks: Ephemeral: RAID 0 Cluster Disks: GlusterFS
Stanford University
EC2 instances
AMIs
S3 storage
S3 storage: • Storage of BAM files • Transfer of BAM and other files • “checkpoint” after each successful workflow stage • Backup of intermediate and final results • Storage of all timings and job information
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
Pre-clinical and clinical
variant annotation
DNA-Seq
RNA-Seq
miRNA
Methyl
BWA GATK Picard
CNV-seq ReadDepth
Segseq
Tophat Cufflinks
BLAST
miRNAkey miRBase
Bismark
SNP/indel
CNV
Gene Exp.
miRNA targets
% Gene Methyl
Pathway Analysis
Classification (Tumor, disease)
Risk Prediction
Stanford University
WGA Applications Genome Analysis Toolkit (GATK) “best practice”.
Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
Variant calling Annotation Preparation/Alignment
Stanford University
Applications Parallelization 5 exomes example
0
100
200
300
400
500
600
5 exome
5 exome
Variant calling Annotation Preparation/Alignment
Stanford University
Alignment: Burrows-Wheeler Aligner
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
Pre-clinical and clinical
variant annotation
DNA-Seq
RNA-Seq
miRNA
Methyl
CNV-seq ReadDepth
Segseq
Tophat Cufflinks
BLAST
miRNAkey miRBase
Bismark
SNP/indel
CNV
Gene Exp.
miRNA targets
% Gene Methyl
Pathway Analysis
Classification (Tumor, disease)
Risk Prediction
BWA GATK Picard
Stanford University
Pre-clinical and clinical
variant annotation
DNA-Seq
RNA-Seq
miRNA
Methyl
BWA GATK Picard
CNV-seq ReadDepth
Segseq
Tophat Cufflinks
BLAST
miRNAkey miRBase
Bismark
SNP/indel
CNV
Gene Exp.
miRNA targets
% Gene Methyl
Pathway Analysis
Classification (Tumor, disease)
Risk Prediction
Stanford University
GenomeKey Implements GATK "best practices" for variant calling.
Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
Variant calling Annotation Preparation/Alignment
GenomeKey
Stanford University
Databases Integrated
Stanford University
Databases Integrated
The_1000g_Febuary_all NHLBI_Exome_Project_euro
NHLBI_Exome_Project_aa
NHLBI_Exome_Project_all HGMD_INDEL
HGMD_SNP
COSMIC
GWAS_Catalog
ENCODE_DNaseI_Hypersensitivity
ENCODE_Transcription_Factor UCSC_Gene
Refseq_Gene
Ensembl_Gene
CCDS_Gene
DrugBank
CytoBank
dbSNP135
TFBS
Segmental_Duplications
RepeatMasker Self Chain
mirBase
TargetScan
SIFT
PolyPhen2
Mutation_Taster GERP
PhyloP
LRT
Mce46way
Complete_Genomics_69
Plus support for generic database file formats such as .bed and .gff3
Stanford University
Workflow Optimization
• Speed:
• Replacing BWA with SNAP (for the same accuracy)
• Re-implement some slow algorithms (e.g. BQSR)
• Accuracy:
• Add additional quality control steps
• Replacing Unified Genotyper with Haplotype Caller
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
COSMOS
Instances Storage AWS
Variant calling Annotation Preparation/Alignment
GenomeKey
Workflow management System
Job
splitting
Web
Interface
Job
tracking
COSMOS
OS & Software
EC2 and S3
Grid
engine Gluster FS MySQL DB Networking
Stanford University
COSMOS Parallelization
0
200
400
600
800
1000
1200
Nu
mb
er
of
Job
s
1 Exome
5 Exomes
10 Exomes
All Runs
Annotation Variant calling Preparation/Alignment
Stanford University
COSMOS Job Splitting
Stanford University
COSMOS Job Splitting
Stanford University
COSMOS Job Splitting
Stanford University
Job Dependency Tracking
PREPARATION / ALIGNMENT
VARIANT CALLING
ANNOTATION
Stanford University
COSMOS Web Interface
PREPARATION / ALIGNMENT
ANNOTATION
VARIANT CALLING
Stanford University
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing Platform: COSMOS
Clinical Whole Genome Analysis Computational Objective:
< 3 hours < $100
Stanford University
Whole Exome Analysis Pre and Post-Optimization
Before
Before
Before
After
After
After
0
5
10
15
20
25
30
1 exome 5 exomes 10 exomes
Wal
l tim
e
~$27
~$10
~$48
~$27
~$90
~$47
Stanford University
Whole Exome Analysis:
Before
Before
Before
After
After
After
0
5
10
15
20
25
30
1 exome 5 exomes 10 exomes
Wal
l tim
e
~$27
~$10
~$48
~$27
~$90
~$47
Stanford University
Whole Exome Analysis:
Before
Before
Before
After
After
After
0
5
10
15
20
25
30
1 exome 5 exomes 10 exomes
Wal
l tim
e
~$27
~$10
~$48
~$27
~$90
~$47
Stanford University
Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information
Clock time: < 3 hours Cost: < $100
Whole Genome Breast Cancer Program
Stanford University
Acknowledgments
LPM (Tonellato) Erik Gafni (InVitae) Vince Fusaro (InVitae) Jared B. Hawkins Ryan Powles Yassine Souilmi
Wall lab (Harvard & Stanford University) Jae-Yoon Jung Alex Lancaster David Tulga
Autism Speaks 6000 Exomes (current) 10,000 Genomes
Ancient Human Genomes David Reich
Stanford University
Tractable, scalable, and economical processing of clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diagnosis
Amazon Web Services Re-Invent 2013
Nov 14th, 2013 Las Vegas, NV
Peter J. Tonellato, PhD Harvard Medical School
Dennis P. Wall, PhD* Stanford University*