Post on 21-Jan-2018
VariantSpark: a library for Genomics
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Lynn Langit
“Genomical” Big Data
Natalie Twine
Transformational Bioinformatics Team
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson
Adrian White
Mia ChampionGaetan Burgio
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai
Kaylene Simpson
Iva Nikolic
Ian Blair
Kelly Williams
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
VariantSpark | Denis C. Bauer @allPowerde
Unsupervised ML : K-Means
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
k-means
Predict super
population
414 ethnic groups and
s u p e rpopulations
VariantSpark | Denis C. Bauer @allPowerde
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
Comparing K-Means Implementations
0
1000
2000
Pytho
n R
Had
oop
Ada
m
ADM
IXTU
RE
Variant
Spa
rk
method
tim
e in
se
co
nd
s
task
binary−conversion
clustering
pre−processing
103 75 29 28 18 4 min
VariantSpark | Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster (12 workers)• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset (mtry = 0.25)
Whole Genome
RangeGWAS Range
Databricks &VariantSpark via a Jupyter notebook
Solving Important Questions…Cancer genomics?
DEMO: Who is a Hipster?
• Quickly access a managed Spark cluster - AWS EC2 / spot instances
• Link to your data and perform whole genome analysis in real-time
VariantSpark & Databricks Notebooks
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Joint-loci association test
Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))
Label = 1 if Hipster-Index>10
Genomic profile Label
Sam
ple
s (
n=
2500)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
VariantSpark: a library for Genomics
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Lynn Langit