Advances of Whole Genome Sequencing in Strawberry with NGS ...
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
-
Upload
paolo-missier -
Category
Technology
-
view
159 -
download
0
Transcript of Invited cloud-e-Genome project talk at 2015 NGS Data Congress
![Page 1: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/1.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scalable WES Processing And Variant InterpretationWith Provenance Recording
Using Workflow On The Cloud
Paolo Missier, Jacek Cała, Yaobo Xu,
Eldarina Wijaya, Ryan Kirby
School of Computing Science and Institute of Genetic MedicineNewcastle University, Newcastle upon Tyne, UK
NGS Data Congress
London, June 15th, 2015
![Page 2: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/2.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
The Cloud-e-Genome project at Newcastle
1. NGS data processing:
• Implement a flexible WES/WGS pipeline
• Scalable deployment over a public cloud
• Cost control• Scalability• Flexibility
• Of design• Of maintenance
• Ensure accountability through traceability
• Enable analytics over past patient cases
2. Traceable variant interpretation:
• Design a simple-to-use tool to facilitate clinical diagnosis by clinicians
• Maintain history of past investigations for analytical purposes
Objectives: With an aim to:
• 2 year pilot project: 2013-2015• Funded by UK’s National Institute for Health Research (NIHR)• Cloud resources from Azure for Research Award
![Page 3: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/3.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Part I: data processing
Objectives:• Design and Implement a flexible WES/WGS pipeline
• Using workflow technology high level programming
• Providing scalable deployment over a public cloud
![Page 4: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/4.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scripted NGS data processing pipeline
RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK
Computes coverage of each read.
VCF Subsetting by filtering, eg non-exomic variants
Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations
Aligns sample sequence to HG19 reference genomeusing BWA aligner
Cleaning, duplicate elimination
Picard tools
Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels
Variant recalibration attempts to reduce false positive rate from caller
![Page 5: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/5.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scripts to workflow - Design
Design Cloud Deployment Execution Analysis
• Better abstraction
• Easier to understand, share, maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
![Page 6: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/6.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”blocksUtility
blocks
![Page 7: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/7.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Workflow design
Conceptual:
Actual:
![Page 8: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/8.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
Sample-split:Parallel processing of samples in a batch
![Page 9: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/9.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Anatomy of a complex parallel dataflow
… with hierarchical structure
![Page 10: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/10.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Phase II, top level
Chromosome-split:Parallel processing of each chromosome across all samples
![Page 11: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/11.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Phase III
Sample-split:Parallel processing of samples
![Page 12: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/12.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Implicit parallelism in the pipeline
align-clean-recalibrate-coverage
…
align-clean-recalibrate-coverage
Sample1
Samplen
Variant callingrecalibration
Variant callingrecalibration
Variant filtering annotation
Variant filtering annotation
……
Chromosomesplit
Per-sample Parallelprocessing
Per-chromosomeParallelprocessing
Stage I Stage II Stage III
How does the workflow design exploit this parallelism?
![Page 13: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/13.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Parallel processing over a batch of exomes
![Page 14: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/14.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Cloud Deployment
Design Cloud Deployment Execution Analysis
• Scalability• Fewer installation/deployment requirements, staff hours required
• Automated dependency management, packaging
• Configurable to make most efficient use of a cluster
![Page 15: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/15.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Workflow engines Module configuration:3 nodes, 24 cores
Modular architecture indefinitely scalable!
![Page 16: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/16.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scripts to workflow
Design Cloud Deployment Execution Analysis
3. Execution
• Runtime monitoring
• provenance collection
![Page 17: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/17.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Performance
3 workflow engines perform better than our HPC benchmark on larger sample sizes
Technical configurations for 3VMs experiments:
HPC cluster (dedicated nodes): used 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
![Page 18: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/18.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scalability
There is little incentive to grow the VM pool beyond 6 engines
![Page 19: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/19.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Cost
Again, a 6 engine configuration achieves near-optimal cost/sample
![Page 20: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/20.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Lessons learnt
Design Cloud Deployment Execution Analysis
Better abstraction
• Easier to understand, share, maintain
Better exploit data parallelismExtensible by wrapping new tools
• Scalability Fewer installation/deployment
requirements, staff hours required Automated dependency management,
packaging Configurable to make most efficient
use of a cluster
Runtime monitoring Provenance collection
Reproducibility Accountability
![Page 21: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/21.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Part II: SVI- Simple, traceable variant interpretation
Objectives:
• Design a simple-to-use tool to facilitate clinical diagnosis by clinicians
• Maintain history of past investigations for analytical purposes
• Ensure accountability through traceability
• Enable analytics over past patient cases
![Page 22: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/22.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
A database of patient cases and investigations
Cases:
![Page 23: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/23.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Investigations
![Page 24: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/24.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Provenance of variant identification
• A provenance graph is generated for each investigation
It accounts for the filtering process for each variant listed in the result
Enables analytics over provenance graphs across many investigations
- “which variants where identified independently on different cases, and how do they correlate with phenotypes?”
Work in progress!
![Page 25: Invited cloud-e-Genome project talk at 2015 NGS Data Congress](https://reader036.fdocuments.us/reader036/viewer/2022062420/55b6e5e6bb61eb86688b462f/html5/thumbnails/25.jpg)
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Summary
1. WES/WGS data processing to annotated variants
• Scalable, Cloud-based
• High level
• Low cost / sample
2.Variant interpretation:• Simple• Targeted at clinicians• Built-in accountability of genetic diagnosis• Analytics over a database of past
investigations
What we are delivering to NIHR: