Genomic Scale Big Data Pipelines

Dr. Denis Bauer & Lynn Langit

Genomic-scale Data Pipelines

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics Team

Denis Bauer, PhD

Oscar Luo, PhD

Rob Dunne, PhD

Piotr Szul

Aidan O’BrienLaurence Wilson, PhD

Adrian WhiteAndy Hindmarch

Collaborators

David Levy

Software

Dan Andrews

Kaitao Lai, PhD

Natalie Twine, PhD

Arash Bayat

John Hildebrandt Mia Chapman

Ian BlairKelly Williams

Jules Damji

Gaetan Burgio Lynn Langit

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

Genome holds the blueprint for every cell

It affects looks, disease risk, and behavior

0 5 10 15 20 25

Astronomy

Twitter

YouTube

Genomic

GENOMIC Big Data in 2025 - Exabytes

VCF Data

Genomic Research Workflow

https://www.projectmine.com/about/

Finding the disease gene(s)

Spot the variant that is…• common amongst all affected

• absent in all unaffected*

* oversimplified

controls

Gene1 Gene2

Cloud Data Pipeline Pattern

Problem

• Define bizproblem

• Quality

• Quantity

• Location

Candidate Technologies

• Ingest

• Clean

• Analyze

• Predict

• Visualize

Build MVPs

• Iterate

• Learn

• Assemble

Assemble Pipeline

• Validate sections

• Test at scale

Candidate Technologies

• Ingest

• Clean

• Analyze

• Predict

• Visualize

Build MVPs

• Iterate

• Learn

• Assemble

Assemble Pipeline

• Validate sections

• Test at scale

Machine Learning Pipeline Pattern

What is CSIRO’s solution?For Scale at reasonable cost Use Apache Hadoop

For Scale at speed Use Apache Spark

For Usability in bioinformatics Create a domain-specific ML API (library)

For global useLeverage Cloud Pipeline Patterns

Transformational Bioinformatics| Denis C. Bauer @allPowerde

GWAS Analysis with Variant-Spark

On-premise Cluster with Apache Hadoop & Spark

Genomics Analysts

CSIRO corporate data center

Why Apache Spark?

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Supervised ML: Wide Random Forests

Solving Important Questions…Cancer genomics?

DEMO: Who is a Hipster?

VariantSpark & Databricks Notebook

databricks Notebook

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

low Accuracy high

Scaling to 50 M variables and 10 K samples

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster • 12 workers

• 16 x Intel Xeon E5-2660@2.20GHz CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset

Whole Genome

RangeGWAS Range

Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

Future Directions for VariantSpark RF

Additional feature types

Unordered Categorical

For Scores -Continuous

Different feature ranges

Small and Big Inputs

For Gene Expression analysis

Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy

Editing does not work every time, e.g. only 7 in 10 embryos were mutation free

Aim: Develop computational guidance framework to enable edits the first time; every time

Ma et al. Nature 2017 *

* Controversy around the paper – stay tuned

Make process parallel and scalable

• SPEED: Each search can be broken down into parallel tasks to then only take seconds

• SCALE: Researchers might want to search the target for one gene or 100,000

Scalability + Agility =

One of the first Serverless Applications in Research

Featured in

This is My Architecture

GT-Scan2

Considering Servicesfor GT-Scan2

• Use AWS Step Functions• Simplify workflow

• Simplify task timeouts

• Simplify task failures

• Must evaluate costs• SNS vs. Step Functions

Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

1. Analyze/GWAS vcf -> S3/Hadoop IngestETLAnalyzeViz

S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook SQL, R or Python

2. Search/GTScan2 S3/fastq-> DynamoDBS3/fastq, bed

IngestETLAnalyzeViz

S3LambdaLambdaLambda/API Gateway

Serverless

Spark Pipeline Pattern

Jupyter Notebook

Serverless Architecture Pattern

Lambda function

buckets with objects DynamoDB

API Gateway Users

Step Functions

Cloud Genomic Data Pipelines• Problem # 1 – Analyze

• Find the mutated genes

• Solution: Spark-based machine learning

• Problem #2 – Scan• Find the nucleotide (DNA letters)

• Solution: Serverless

Genomics Big Data Pipelines

Dr. Denis Bauer & Lynn Langit

Genomic Scale Big Data Pipelines

Science

Transcript of Genomic Scale Big Data Pipelines

Large scale data processing pipelines at trivago

A Genomic-Scale Artiﬁcial MicroRNA Library as a Tool to ... · LARGE-SCALE BIOLOGY ARTICLE A Genomic-Scale Artiﬁcial MicroRNA Library as a Tool to Investigate the Functionally

Supervised and unsupervised methods for large scale genomic data integration

KeystoneML: Optimizing Pipelines for Large … Optimizing Pipelines for Large-Scale Advanced Analytics Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael Franklin, Benjamin

Building Highly-Optimized, Low-Latency Pipelines for ...cidrdb.org/cidr2015/Papers/CIDR15_Paper14u.pdf · Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

Large-scale genomic analyses link reproductive aging to ...

Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Large scale genomic data mining

LARGE SCALE EXPERIMENTS OF BURIED STEEL PIPELINES WITH ...

Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale

Genomic Data and Privacy: Background and Relevant La · Genetic and genomic research have generated large amounts of genetic data. If these “large-scale genomic data” are generated

Using large-scale genomic data sets to understand the ... · Using large-scale genomic data sets to understand the impact of human genetic variation ... Eric Vallabh Minikel Science

Dating genomic variants and shared ancestry in population ... › content › biorxiv › early › 2019 › ... · Dating genomic variants and shared ancestry in population-scale

Resolving highly complex rearrangements of genomic … · 2019-06-07 · Genomic disorders are diseases that result from chromosomal rearrangements, rather than from nucleotide-scale

Computation of Large-Scale Genomic Evaluations

Workday: Building Large Scale Machine Learning Pipelines

KeystoneML: Optimizing Pipelines for Large-Scale …...KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics Evan R. Sparks , Shivaram Venkataraman , Tomer Kaftanz, Michael

Building Highly-Optimized, Low-Latency Pipelines for Genomic … › ~aroy › genomic-cidr15.pdf · 2014-12-08 · Building Highly-Optimized, Low-Latency Pipelines for Genomic Data

Mango: Interactive Exploration on Large Genomic Datasetskubitron/courses/... · platforms in their analysis pipelines and le formats. Apache Spark is utilized by both of these pipelines,

Large scale data processing pipelines at trivago: a use case