Building cloud-enabled genomics workflows with Luigi and Docker
Cloud Accelerated Genomics
-
Upload
idan-tohami -
Category
Science
-
view
295 -
download
0
Transcript of Cloud Accelerated Genomics
![Page 1: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/1.jpg)
Cloud Accelerated Genomics
Allen Day, PhD // Science Advocate
@allenday // #genomics #ml #datascience
![Page 2: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/2.jpg)
Table of Contents
Section 1
Section 2
Section 3
Throughout
Getting from Research to Application… FasterWhat are the bottlenecks for translating research into products? Emphasis on information processing.
From CompBio Research to CompBio EngineeringGetting results, more of them, and predictably improving
Data Integration - Cutting Edge Use CasesWhat’s happening right now in industry and academia?
How to use Google Cloud?I’ll introduce specific cloud services, along with examples of how they’ve been used successfully. Compute Engine, Kubernetes, Dataflow, Cloud ML, Genomics API
![Page 3: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/3.jpg)
How to Understand?
Linear B is a syllabic script that was used for writing Mycenaean Greek, the earliest attested form of Greek. The script predates the Greek alphabet by several centuries. The oldest Mycenaean writing dates to about 1450 BC.
![Page 4: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/4.jpg)
Hypothetico-Deductive Method (Iterative)
Organize
Analyze, Interpret, and
Plan
Choose Data
Acquire
![Page 5: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/5.jpg)
Hypothetico-Deductive Method (Iterative)
Organize
Analyze, Interpret, and
Plan
Choose Data
Acquire
Situation:Not enough data.No means to get more.Dead Language.
Outcome:Cannot understand.
Also:Passive learning.No feedback.
![Page 6: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/6.jpg)
DNA Sequencing Value Chain
% E
ffor
t
0
100
Pre-NGS~2000
Future~2020
Now
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary Analytics
Analytics, Intepretation,
Planning
Experiment Design
DNA Sequencing
![Page 7: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/7.jpg)
Human Genetics Scenario
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary Analytics
Analytics, Intepretation,
Planning
Experiment Design
% E
ffor
t
0
100
DNA Sequencing
Situation:Unlimited Free DNA
Result:Slow to understand.
Pre-NGS~2000
Future~2020
Now
![Page 8: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/8.jpg)
Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary Analytics
Analytics, Intepretation,
Planning
Experiment Design
% E
ffor
t
0
100
DNA Sequencing
Situation:We still have an analysis bottleneck
Result:Slow to understand.
Pre-NGS~2000
Future~2020
Now
![Page 9: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/9.jpg)
00:20 - Connecting…01:22 - Link Established
![Page 10: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/10.jpg)
![Page 11: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/11.jpg)
GOOGLE CONFIDENTIAL
Google Cloud Platform lets you run your apps on the same system as Google
![Page 12: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/12.jpg)
GOOGLE CONFIDENTIAL
So you can focus on what matters to your science
![Page 13: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/13.jpg)
Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
![Page 14: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/14.jpg)
Google confidential │ Do not distribute
Google can is good at handleing massive volumes of genomic data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6WGS
>100x US PhDs
~1M WGS
0.25s
![Page 15: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/15.jpg)
Google confidential │ Do not distribute Google confidential │ Do not distribute
Google GenomicsAugust 2015
![Page 16: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/16.jpg)
Google confidential │ Do not distribute
Google Genomics is more than infrastructure
General-purpose cloud infrastructure
Genomics-specific featuresGenomics API
Virtual Machines & Storage
Data Services & Tools
![Page 17: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/17.jpg)
Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
![Page 18: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/18.jpg)
Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
![Page 19: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/19.jpg)
CONFIDENTIAL & PROPRIETARY
3.75 TERABYTES PER HUMAN
1.00 TB GENOME 2.00 TB EPIGENOME 0.70 TB TRANSCRIPTOME 0.06 TB METABOLOME 0.04 TB PROTEOME ~1 MB STANDARD LAB TESTS
5-YR LONGITUDINAL STUDY
BASELINE STUDY: BIG DATA ANALYSISValidate a pipeline to process complex phenotypic, biochemical, and genomic data
● Pilot Study (N=200)○ Determine optimal biospecimen collection strategy for stable sampling
and reproducible assays○ Determine optimal assay methodology ○ Validate quality control methods○ Validate device data against surrogate and primary endpoints
● Baseline Study (N=10,000+) ○ 6 cohorts from low to high risk for cardiovascular and cancer○ Characterize human systems biology ○ Define normal values for a given parameter in heterogeneous states○ Predict meaningful events ○ Validate wearable devices for human monitoring ○ Characterize transitions in disease state
![Page 20: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/20.jpg)
Public Datasets Projecthttps://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)
![Page 21: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/21.jpg)
Confidential & ProprietaryGoogle Cloud Platform 21
Platinum Genomes
1000 Genomes
Medical (Human)
Population-scale Genome Projects
1000 Bulls
10K Dog Genomes
Veterinary / Agriculture
Open Cannabis Project
Genome To Fields
Panzea (1000 Maize)
AgriculturePersonal Genome Project
Human Microbiome Project
NCBI GEO Human 100K
Cancer Genome Atlas
Many OtherInterestingDatasets...
![Page 22: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/22.jpg)
Google confidential │ Do not distribute
PI / Biologist : variant calls for the 1,000 genomes
![Page 23: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/23.jpg)
Google confidential │ Do not distribute
Information: principal coordinates analysis (1000 genomes)
![Page 24: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/24.jpg)
Google confidential │ Do not distribute
Knowledge: populations cluster together
![Page 25: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/25.jpg)
Bioinformatics scientist: BigQuery enables fast tertiary analysis
![Page 26: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/26.jpg)
Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against multi-terabyte datasets in seconds.
cloud.google.com/bigquery
BigQuery
![Page 27: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/27.jpg)
Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against multi-terabyte datasets in seconds.
cloud.google.com/bigquery
BigQuery
![Page 28: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/28.jpg)
Google Cloud Platform
Dataflow + BigQuery
![Page 29: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/29.jpg)
Google confidential │ Do not distribute
Example: GATK Analysis Pipeline
Old way: install applications on host
kernel
libs
app
app app
app
Makefiles, CWL, WDL
(on a virtual machine)
![Page 30: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/30.jpg)
![Page 31: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/31.jpg)
![Page 32: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/32.jpg)
Google confidential │ Do not distribute
Example: GATK Analysis Pipeline
Old way: install applications on host
kernel
libs
app
app app
app
Makefiles, CWL, WDL
(on a virtual machine)
![Page 33: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/33.jpg)
Google confidential │ Do not distribute
Example: GATK Analysis Pipeline
● Decouple process management from host configuration
● Portable across OS distros and clouds
● Consistent environment from development to production
● Immutable images
New way: deploy containers
Old way: install applications on host
kernel
libs
app
app app
app
libs
app
kernel
libs
app
libs
app
libs
app
Makefiles, CWL, WDL
(on a virtual machine)
Dockerflow:Dataflow + Docker
Benefits
![Page 34: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/34.jpg)
Google confidential │ Do not distribute
Use Case:Reproducible Science with Docker
● Objective: Build a mutation-detection pipeline
● Provided to competitors○ Training data set○ Evalutation data set
● Competitors submit pipelines as Docker images to DREAM Challenge host, Sage Bionetworks● Submitted pipelines were used to process unseen data set● Post-competition, Docker images made public
● Incidentally, Google won this competition with a deep-learning based variant caller called DeepVariant cloud.google.com/genomics/v1alpha2/deepvariant
![Page 35: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/35.jpg)
Confidential & ProprietaryGoogle Cloud Platform 35
An idealized version of the hypothetico-deductive model of the scientific method is shown. Various potential threats to this model exist (indicated in red), including hypothesizing after the results are known (HARKing) and lack of data sharing. Together these undermine the robustness of results, and may impact on the ability of science to self-correct.
Threats to reproducible science.
http://www.nature.com/articles/s41562-016-0021
![Page 36: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/36.jpg)
> java -jar target/dockerflow*dependencies.jar\
--project=YOUR_PROJECT\
--workflow-file=hello.yaml\
--workspace=gs://YOUR_BUCKET/YOUR_FOLDER\
--runner=DataflowPipelineRunner
To run it:
Variant Calls
Your Variant Caller
36PubSub Queue
SequencerDNA Reads
Genomics API
Genomics API
BigQuery
Your Other Tool
![Page 37: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/37.jpg)
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsantohttps://www.youtube.com/watch?v=6KEvLURBenM
![Page 38: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/38.jpg)
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsantohttps://www.youtube.com/watch?v=6KEvLURBenM
![Page 39: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/39.jpg)
Marker-assisted selection for quantitative traits
![Page 40: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/40.jpg)
Marker-assisted selection for quantitative traitshttps://www.sec.gov/Archives/edgar/data/1110783/000095013402011773/c71992exv99w2.htm
![Page 41: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/41.jpg)
Google Cloud Platform
Marker-Assisted Breeding Rapidly Increases Frequency of Favorable Genes
https://www.slideshare.net/finance28/monsanto-082305a
![Page 42: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/42.jpg)
Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary Analytics
Analytics, Intepretation,
Planning
Experiment Design
% E
ffor
t
0
100
DNA Sequencing
Situation:We still have an analysis bottleneck
Result:Slow to understand.
Pre-NGS~2000
Future~2020
Now
![Page 43: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/43.jpg)
Q: Why Slow to Understand? A2: Limited Feedback
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary Analytics
Analytics, Intepretation,
Planning
Experiment Design
DNA Sequencing
Situation:Data acquisition cost approaches zero
However, still slow to understand, because:
1. Restricted choice of what can be observed, i.e. controlled modifications and artificial selection
2. Passive Learning. Limited feedback => Low rate of learning
Contrast with active learning...
![Page 44: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/44.jpg)
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:DNA sequencer,Mass spectrometer,Etc
However... (Technology)-Limited Experimental Capability
![Page 45: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/45.jpg)
Google Cloud Platform
Even Moore’s Law / Carlson Curve
![Page 46: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/46.jpg)
Google Cloud Platform
Even Moore’s Law / Carlson Curve - also applies to writing DNA
![Page 47: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/47.jpg)
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:DNA sequencer,Mass spectrometer,Etc
Bioengineering Tech:DNA synthesizers,CRISPR/Cas9,Etc
![Page 48: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/48.jpg)
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:DNA sequencer,Mass spectrometer,Etc
Environmental Sensors:Laser scanners,Hyperspectral scanners,UAVsEtc
Bioengineering Tech:DNA synthesizers,CRISPR/Cas9,Etc
Regulate/Measure System I/O
![Page 49: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/49.jpg)
Google Cloud Platform
Integration with Geospatial, Management, and Terrestrial Sensor Data
anezconsulting.com/precision-agronomy/
![Page 50: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/50.jpg)
Google Cloud Platform
Descartes Labs - Google Cloud Customer
medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1
Cloud ML Engine
TensorFlow
![Page 51: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/51.jpg)
Google Cloud Platform
Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm
See also: http://www.genomes2fields.org/
![Page 52: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/52.jpg)
Google Cloud Platform
Temporo-Spatial Imaging of Growing Plants
![Page 53: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/53.jpg)
Google Cloud Platform
Verily: Assisting Pathologists in Detecting Cancer with Deep Learning
research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
Prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2. We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset.
Model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.
![Page 54: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/54.jpg)
00:20 - Connecting…01:22 - Link Established
![Page 55: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/55.jpg)
Google Cloud Platform
~~)( ,Cloud VisionTensorFlowGoogle Genomics Dataflow Cloud ML Engine Docker
Baseline Study Data Private DataPublic Data
![Page 56: Cloud Accelerated Genomics](https://reader031.fdocuments.us/reader031/viewer/2022030313/58d0d6d61a28ab47238b587d/html5/thumbnails/56.jpg)
Build What’s NextThank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience