Cloud Accelerated Genomics

56
Cloud Accelerated Genomics Allen Day, PhD // Science Advocate @allenday // #genomics #ml #datascience

Transcript of Cloud Accelerated Genomics

Page 1: Cloud Accelerated Genomics

Cloud Accelerated Genomics

Allen Day, PhD // Science Advocate

@allenday // #genomics #ml #datascience

Page 2: Cloud Accelerated Genomics

Table of Contents

Section 1

Section 2

Section 3

Throughout

Getting from Research to Application… FasterWhat are the bottlenecks for translating research into products? Emphasis on information processing.

From CompBio Research to CompBio EngineeringGetting results, more of them, and predictably improving

Data Integration - Cutting Edge Use CasesWhat’s happening right now in industry and academia?

How to use Google Cloud?I’ll introduce specific cloud services, along with examples of how they’ve been used successfully. Compute Engine, Kubernetes, Dataflow, Cloud ML, Genomics API

Page 3: Cloud Accelerated Genomics

How to Understand?

Linear B is a syllabic script that was used for writing Mycenaean Greek, the earliest attested form of Greek. The script predates the Greek alphabet by several centuries. The oldest Mycenaean writing dates to about 1450 BC.

Page 4: Cloud Accelerated Genomics

Hypothetico-Deductive Method (Iterative)

Organize

Analyze, Interpret, and

Plan

Choose Data

Acquire

Page 5: Cloud Accelerated Genomics

Hypothetico-Deductive Method (Iterative)

Organize

Analyze, Interpret, and

Plan

Choose Data

Acquire

Situation:Not enough data.No means to get more.Dead Language.

Outcome:Cannot understand.

Also:Passive learning.No feedback.

Page 6: Cloud Accelerated Genomics

DNA Sequencing Value Chain

% E

ffor

t

0

100

Pre-NGS~2000

Future~2020

Now

Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Secondary Analytics

Analytics, Intepretation,

Planning

Experiment Design

DNA Sequencing

Page 7: Cloud Accelerated Genomics

Human Genetics Scenario

Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Secondary Analytics

Analytics, Intepretation,

Planning

Experiment Design

% E

ffor

t

0

100

DNA Sequencing

Situation:Unlimited Free DNA

Result:Slow to understand.

Pre-NGS~2000

Future~2020

Now

Page 8: Cloud Accelerated Genomics

Q: Why Slow to Understand? A1: Data Processing

Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Secondary Analytics

Analytics, Intepretation,

Planning

Experiment Design

% E

ffor

t

0

100

DNA Sequencing

Situation:We still have an analysis bottleneck

Result:Slow to understand.

Pre-NGS~2000

Future~2020

Now

Page 9: Cloud Accelerated Genomics

00:20 - Connecting…01:22 - Link Established

Page 10: Cloud Accelerated Genomics
Page 11: Cloud Accelerated Genomics

GOOGLE CONFIDENTIAL

Google Cloud Platform lets you run your apps on the same system as Google

Page 12: Cloud Accelerated Genomics

GOOGLE CONFIDENTIAL

So you can focus on what matters to your science

Page 13: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Google is good at handling massive volumes of data

uploads per minute

users

search index

query response time

300hrs

500M+

100PB+

0.25s

Page 14: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Google can is good at handleing massive volumes of genomic data

uploads per minute

users

search index

query response time

300hrs

500M+

100PB+

0.25s

~6WGS

>100x US PhDs

~1M WGS

0.25s

Page 15: Cloud Accelerated Genomics

Google confidential │ Do not distribute Google confidential │ Do not distribute

Google GenomicsAugust 2015

Page 16: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Google Genomics is more than infrastructure

General-purpose cloud infrastructure

Genomics-specific featuresGenomics API

Virtual Machines & Storage

Data Services & Tools

Page 17: Cloud Accelerated Genomics

Google confidential │ Do not distribute

BioQuery Analysis Engine

Medical Records Genomics Devices Imaging Patient Reports

Baseline Study Data Private Data

Pharma Health Providers …

Google’s vision to tackle complex health data

Public Data

Page 18: Cloud Accelerated Genomics

Google confidential │ Do not distribute

BioQuery Analysis Engine

Medical Records Genomics Devices Imaging Patient Reports

Baseline Study Data Private Data

Pharma Health Providers …

Google’s vision to tackle complex health data

Public Data

Page 19: Cloud Accelerated Genomics

CONFIDENTIAL & PROPRIETARY

3.75 TERABYTES PER HUMAN

1.00 TB GENOME 2.00 TB EPIGENOME 0.70 TB TRANSCRIPTOME 0.06 TB METABOLOME 0.04 TB PROTEOME ~1 MB STANDARD LAB TESTS

5-YR LONGITUDINAL STUDY

BASELINE STUDY: BIG DATA ANALYSISValidate a pipeline to process complex phenotypic, biochemical, and genomic data

● Pilot Study (N=200)○ Determine optimal biospecimen collection strategy for stable sampling

and reproducible assays○ Determine optimal assay methodology ○ Validate quality control methods○ Validate device data against surrogate and primary endpoints

● Baseline Study (N=10,000+) ○ 6 cohorts from low to high risk for cardiovascular and cancer○ Characterize human systems biology ○ Define normal values for a given parameter in heterogeneous states○ Predict meaningful events ○ Validate wearable devices for human monitoring ○ Characterize transitions in disease state

Page 20: Cloud Accelerated Genomics

Public Datasets Projecthttps://cloud.google.com/bigquery/public-data/

A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)

Page 21: Cloud Accelerated Genomics

Confidential & ProprietaryGoogle Cloud Platform 21

Platinum Genomes

1000 Genomes

Medical (Human)

Population-scale Genome Projects

1000 Bulls

10K Dog Genomes

Veterinary / Agriculture

Open Cannabis Project

Genome To Fields

Panzea (1000 Maize)

AgriculturePersonal Genome Project

Human Microbiome Project

NCBI GEO Human 100K

Cancer Genome Atlas

Many OtherInterestingDatasets...

Page 22: Cloud Accelerated Genomics

Google confidential │ Do not distribute

PI / Biologist : variant calls for the 1,000 genomes

Page 23: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Information: principal coordinates analysis (1000 genomes)

Page 24: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Knowledge: populations cluster together

Page 25: Cloud Accelerated Genomics

Bioinformatics scientist: BigQuery enables fast tertiary analysis

Page 26: Cloud Accelerated Genomics

Google Cloud Platform

Dataflow + BigQuery

Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration.

cloud.google.com/dataflow

Dataflow

Run SQL queries against multi-terabyte datasets in seconds.

cloud.google.com/bigquery

BigQuery

Page 27: Cloud Accelerated Genomics

Google Cloud Platform

Dataflow + BigQuery

Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration.

cloud.google.com/dataflow

Dataflow

Run SQL queries against multi-terabyte datasets in seconds.

cloud.google.com/bigquery

BigQuery

Page 28: Cloud Accelerated Genomics

Google Cloud Platform

Dataflow + BigQuery

Page 29: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Example: GATK Analysis Pipeline

Old way: install applications on host

kernel

libs

app

app app

app

Makefiles, CWL, WDL

(on a virtual machine)

Page 30: Cloud Accelerated Genomics
Page 31: Cloud Accelerated Genomics
Page 32: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Example: GATK Analysis Pipeline

Old way: install applications on host

kernel

libs

app

app app

app

Makefiles, CWL, WDL

(on a virtual machine)

Page 33: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Example: GATK Analysis Pipeline

● Decouple process management from host configuration

● Portable across OS distros and clouds

● Consistent environment from development to production

● Immutable images

New way: deploy containers

Old way: install applications on host

kernel

libs

app

app app

app

libs

app

kernel

libs

app

libs

app

libs

app

Makefiles, CWL, WDL

(on a virtual machine)

Dockerflow:Dataflow + Docker

Benefits

Page 34: Cloud Accelerated Genomics

Google confidential │ Do not distribute

Use Case:Reproducible Science with Docker

● Objective: Build a mutation-detection pipeline

● Provided to competitors○ Training data set○ Evalutation data set

● Competitors submit pipelines as Docker images to DREAM Challenge host, Sage Bionetworks● Submitted pipelines were used to process unseen data set● Post-competition, Docker images made public

● Incidentally, Google won this competition with a deep-learning based variant caller called DeepVariant cloud.google.com/genomics/v1alpha2/deepvariant

Page 35: Cloud Accelerated Genomics

Confidential & ProprietaryGoogle Cloud Platform 35

An idealized version of the hypothetico-deductive model of the scientific method is shown. Various potential threats to this model exist (indicated in red), including hypothesizing after the results are known (HARKing) and lack of data sharing. Together these undermine the robustness of results, and may impact on the ability of science to self-correct.

Threats to reproducible science.

http://www.nature.com/articles/s41562-016-0021

Page 36: Cloud Accelerated Genomics

> java -jar target/dockerflow*dependencies.jar\

--project=YOUR_PROJECT\

--workflow-file=hello.yaml\

--workspace=gs://YOUR_BUCKET/YOUR_FOLDER\

--runner=DataflowPipelineRunner

To run it:

Variant Calls

Your Variant Caller

36PubSub Queue

SequencerDNA Reads

Genomics API

Genomics API

BigQuery

Your Other Tool

Page 37: Cloud Accelerated Genomics

GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsantohttps://www.youtube.com/watch?v=6KEvLURBenM

Page 38: Cloud Accelerated Genomics

GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsantohttps://www.youtube.com/watch?v=6KEvLURBenM

Page 39: Cloud Accelerated Genomics

Marker-assisted selection for quantitative traits

Page 40: Cloud Accelerated Genomics

Marker-assisted selection for quantitative traitshttps://www.sec.gov/Archives/edgar/data/1110783/000095013402011773/c71992exv99w2.htm

Page 41: Cloud Accelerated Genomics

Google Cloud Platform

Marker-Assisted Breeding Rapidly Increases Frequency of Favorable Genes

https://www.slideshare.net/finance28/monsanto-082305a

Page 42: Cloud Accelerated Genomics

Q: Why Slow to Understand? A1: Data Processing

Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Secondary Analytics

Analytics, Intepretation,

Planning

Experiment Design

% E

ffor

t

0

100

DNA Sequencing

Situation:We still have an analysis bottleneck

Result:Slow to understand.

Pre-NGS~2000

Future~2020

Now

Page 43: Cloud Accelerated Genomics

Q: Why Slow to Understand? A2: Limited Feedback

Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Secondary Analytics

Analytics, Intepretation,

Planning

Experiment Design

DNA Sequencing

Situation:Data acquisition cost approaches zero

However, still slow to understand, because:

1. Restricted choice of what can be observed, i.e. controlled modifications and artificial selection

2. Passive Learning. Limited feedback => Low rate of learning

Contrast with active learning...

Page 44: Cloud Accelerated Genomics

Act

Observe

Observe

Act

Orient Decide

Decide Act

Biological System

Scientist

Molecular Sensors:DNA sequencer,Mass spectrometer,Etc

However... (Technology)-Limited Experimental Capability

Page 45: Cloud Accelerated Genomics

Google Cloud Platform

Even Moore’s Law / Carlson Curve

Page 46: Cloud Accelerated Genomics

Google Cloud Platform

Even Moore’s Law / Carlson Curve - also applies to writing DNA

Page 47: Cloud Accelerated Genomics

Act

Observe

Observe

Act

Orient Decide

Decide Act

Biological System

Scientist

Molecular Sensors:DNA sequencer,Mass spectrometer,Etc

Bioengineering Tech:DNA synthesizers,CRISPR/Cas9,Etc

Page 48: Cloud Accelerated Genomics

Act

Observe

Observe

Act

Orient Decide

Decide Act

Biological System

Scientist

Molecular Sensors:DNA sequencer,Mass spectrometer,Etc

Environmental Sensors:Laser scanners,Hyperspectral scanners,UAVsEtc

Bioengineering Tech:DNA synthesizers,CRISPR/Cas9,Etc

Regulate/Measure System I/O

Page 49: Cloud Accelerated Genomics

Google Cloud Platform

Integration with Geospatial, Management, and Terrestrial Sensor Data

anezconsulting.com/precision-agronomy/

Page 50: Cloud Accelerated Genomics

Google Cloud Platform

Descartes Labs - Google Cloud Customer

medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1

Cloud ML Engine

TensorFlow

Page 51: Cloud Accelerated Genomics

Google Cloud Platform

Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm

See also: http://www.genomes2fields.org/

Page 52: Cloud Accelerated Genomics

Google Cloud Platform

Temporo-Spatial Imaging of Growing Plants

Page 53: Cloud Accelerated Genomics

Google Cloud Platform

Verily: Assisting Pathologists in Detecting Cancer with Deep Learning

research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html

Prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2. We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset.

Model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.

Page 54: Cloud Accelerated Genomics

00:20 - Connecting…01:22 - Link Established

Page 55: Cloud Accelerated Genomics

Google Cloud Platform

~~)( ,Cloud VisionTensorFlowGoogle Genomics Dataflow Cloud ML Engine Docker

Baseline Study Data Private DataPublic Data

Page 56: Cloud Accelerated Genomics

Build What’s NextThank You!

Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience