VariantSpark - a Spark library for genomics

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit

“Genomical” Big Data

Natalie Twine

Transformational Bioinformatics Team

Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson

Adrian White

Mia ChampionGaetan Burgio

Collaborators

David Levy

Software

Dan Andrews

Kaitao Lai

Kaylene Simpson

Iva Nikolic

Ian Blair

Kelly Williams

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

VariantSpark | Denis C. Bauer @allPowerde

Unsupervised ML : K-Means

www.cloudaccess.eu

1000 x 40 Million variants

Matrix *

k-means

Predict super

population

414 ethnic groups and

s u p e rpopulations

* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Comparing K-Means Implementations

Variant

method

binary−conversion

clustering

pre−processing

103 75 29 28 18 4 min

Supervised ML: Wide Random Forests

Genomic Research Workflow

https://www.projectmine.com/about/

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

Scaling to 50 M variables and 10 K samples

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster (12 workers)• 16 x Intel Xeon E5-2660@2.20GHz CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset (mtry = 0.25)

Whole Genome

RangeGWAS Range

Databricks &VariantSpark via a Jupyter notebook

Solving Important Questions…Cancer genomics?

DEMO: Who is a Hipster?

• Quickly access a managed Spark cluster - AWS EC2 / spot instances

• Link to your data and perform whole genome analysis in real-time

VariantSpark & Databricks Notebooks

Jupyter Notebook

Joint-loci association test

Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))

Label = 1 if Hipster-Index>10

Genomic profile Label

Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

VariantSpark: a library for Genomics

Lynn Langit

VariantSpark - a Spark library for genomics

Science

Transcript of VariantSpark - a Spark library for genomics

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job

Spark & Spark SQL

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

MAKERSPACES - Lsslibraries€¦ · within the library, Spark Space empowers young minds to find their spark and connect with their passions while encouraging play and providing nurturing

Spark SQL | Apache Spark

Genomics Research Institute University of Cincinnati Compound Library Wm. L. Seibel January 10, 2007.

Health Sciences Core Research Facilities Genomics Research ... · Next GEM . Input Cell Suspension Library Construction Barcoding & Library Construction Sequencing Sequence Transcriptome

Putting the Spark into Functional Fashion Tech Analytics · Metail and Spark– Easy when you can use their serialization library. 25 ... • Runs on AWS Elastic MapReduce – Tune

Functional genomics approaches to disease genomics

Genetics, genomics and the patenting of DNA - who.int · ii WHO Library Cataloguing-in-Publication Data Genetics, genomics and the patenting of DNA : review of potential implications

Functional genomics screen with pooled shRNA …Functional genomics screen with pooled shRNA library and gene expression profiling with extracts of Azadirachta indica identify potential

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

What are Genomics and Computational Genomics?

Distributed Big Data Library Apache Spark• Spark paper came out. • Spark Streaming was incorporated in 2011. • Transferred to the Apache Software foundation in 2013 • Spark

datasets using Databricks Analyzing massive genomics · • Use Spark SQL, ADAM, or Hail for overlap and aggregate queries ... • Spark allows you to program across large clusters

© 2017 The MathWorks, Inc.€¦ · Run MATLAB scripts on SPARK & HADOOP Master Name Node Worker Nodes Hadoop & Spark Library HDFS YARN Resource Data Nodes Manager Edge Node Spark-submit

Structural Genomics, ISGO, and Structural Genomics Task Forces Open ISGO Structural Genomics Task Force Meeting ISGO International Structural Genomics.

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

A Generate-Test-Aggregate Parallel Programming Library on Spark

PIs: David L. Blunck, Kyle Niemeyer, and Spark Kernels ... Library/Research/Coal/Combustion... · Spark Kernels Pulse Detonation Engine for Advanced Oxy-Combustion of Coal-Based Fuels