VariantSpark - a Spark library for genomics

Post on 21-Jan-2018

1.516 views 1 download

Transcript of VariantSpark - a Spark library for genomics

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit

“Genomical” Big Data

Natalie Twine

Transformational Bioinformatics Team

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson

Adrian White

Mia ChampionGaetan Burgio

Collaborators

David Levy

News

Software

Dan Andrews

Kaitao Lai

Kaylene Simpson

Iva Nikolic

Ian Blair

Kelly Williams

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Cited

4

VariantSpark | Denis C. Bauer @allPowerde

Unsupervised ML : K-Means

www.cloudaccess.eu

1000 x 40 Million variants

Matrix *

k-means

Predict super

population

414 ethnic groups and

s u p e rpopulations

VariantSpark | Denis C. Bauer @allPowerde

* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Comparing K-Means Implementations

0

1000

2000

Pytho

n R

Had

oop

Ada

m

ADM

IXTU

RE

Variant

Spa

rk

method

tim

e in

se

co

nd

s

task

binary−conversion

clustering

pre−processing

103 75 29 28 18 4 min

VariantSpark | Denis C. Bauer @allPowerde

Supervised ML: Wide Random Forests

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Genomic Research Workflow

https://www.projectmine.com/about/

Focus

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Scaling to 50 M variables and 10 K samples

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster (12 workers)• 16 x Intel Xeon E5-2660@2.20GHz CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset (mtry = 0.25)

Whole Genome

RangeGWAS Range

Databricks &VariantSpark via a Jupyter notebook

Solving Important Questions…Cancer genomics?

DEMO: Who is a Hipster?

• Quickly access a managed Spark cluster - AWS EC2 / spot instances

• Link to your data and perform whole genome analysis in real-time

VariantSpark & Databricks Notebooks

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Jupyter Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Joint-loci association test

Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))

Label = 1 if Hipster-Index>10

Genomic profile Label

Sam

ple

s (

n=

2500)

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit