VariantSpark: applying Spark-based machine learning methods to genomic information
-
Upload
denis-c-bauer -
Category
Science
-
view
997 -
download
4
Transcript of VariantSpark: applying Spark-based machine learning methods to genomic information
![Page 1: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/1.jpg)
Denis C. Bauer | Bioinformatics | @allPowerde20 Nov 2015
CSIRO HEALTH & BIOSECURITY
VariantSpark: applying Spark-based machine learning methods to genomic information
By Tim
Cooper
![Page 2: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/2.jpg)
Talk Overview
VariantSpark| Denis C. Bauer @allPowerde2 |
• Background: Why is genomics so important for medicine• VariantSpark: Overview• Whole Genome Analysis: Clustering samples by ethnicity
![Page 3: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/3.jpg)
Genome sequencing improves diagnosticsGenomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper
Presentation title | Presenter name3 |
Oncology
Tandem duplications
Tandem duplications
Identifying tumours by their genome-wide mutation profiles
Rare genetic disordersIdentifying causative mutations by interrogating all abnormal variants
http://matt.might.net
Bauer et al. Trends Mol Med. 2014 PMID: 24801560
![Page 4: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/4.jpg)
Generating data from 1 Million Americans
Presentation title | Presenter name4 |
Australia: ~ 100 Million dedicated to clinical genomics• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
![Page 5: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/5.jpg)
100,000 Genomes project70,000 individuals by 2017
The cancer genome atlas11,000 samples 2015
Genomics projects are getting bigger
VariantSpark| Denis C. Bauer @allPowerde | Page 5
The HapMap Project270 samples 2002
Human genome~1 sample
1000 Genome Project1097 samples 2012
Project MinE15,000 people with ALS
ASPREE4000 healthy 70+ year olds
Single samples are around 200GB in size
![Page 6: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/6.jpg)
Last Year
VariantSpark| Denis C. Bauer @allPowerde | Page 6
![Page 7: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/7.jpg)
Data Analysis categories for genomics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
VariantSpark| Denis C. Bauer @allPowerde | Page 7
![Page 8: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/8.jpg)
VariantSpark
Mllib*
VCF
VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data
e.g. grouping samples by genomic profile
Input Genomics Application Result
Larg
e sc
ale
com
pute
VariantSpark| Denis C. Bauer @allPowerde | Page 8
* VariantSpark also uses Spark.ML
![Page 9: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/9.jpg)
VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 9
Accepted BMC Genomics (IF=4)
![Page 10: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/10.jpg)
Cluster individuals into ethnic groups based on their genomic profiles
www.cloudaccess.eu
1000 x 40 Million variants Matrix *
Kmeans
Predict super population
414 ethnic groups and
s u p e r populations
VariantSpark| Denis C. Bauer @allPowerde | Page 10
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
![Page 11: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/11.jpg)
Clustering result
• (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match)
• Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds.
• ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25
Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)
VariantSpark| Denis C. Bauer @allPowerde | Page 11
![Page 12: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/12.jpg)
Comparison to other implementations
• Preprocessing: converting location-centric VCF genotypes into sample-centric numerical vectors
• Clustering: Kmeans
• ADAM (BigData Genomics): Spark implementation with dense matrix
• Hadoop: MapReduce without in-memory caching
Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu.
103 75 29 28 18 4 min
VariantSpark| Denis C. Bauer @allPowerde | Page 12
![Page 13: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/13.jpg)
Scaling VariantSpark to the whole genome • Pre-processing: scales
seamlessly as processes are independent
• Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids
• As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime.
CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.
VariantSpark| Denis C. Bauer @allPowerde | Page 13
![Page 14: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/14.jpg)
Three things to remember
• VariantSpark is an interface bringing bigLearning tasks to genomics applications
• VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits.
• VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude.
https://github.com/BauerLab/VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 14
![Page 15: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/15.jpg)
HEALTH AND BIOSECURITY
Thank youHealth & BiosecurityDenis C. Bauert +61 2 9123 4567e [email protected] aehrc.com/biomedical-informatics/
transformational-bioinformatics/
More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde
Aidan O’BrienBill WilsonTransformational Bioinformatics Team, CSIROFormer membersFiroz AnwarNeil Saunders
Rodney ScottNewcastle University
Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund
Buske et al., Bioinformatics Jan 2014
O’Brien et al., BMC Genomics Dec 2015
Dunne et al., in preparation
FullySICEpistatic Gene Network modelling
in preparation
Anwar et al., in preparation
Piotr SzulGi GuoRobert DunneData61 CSIRO, Australia
GOdistinctGO Enrichment or genesets with distinctive function
![Page 16: VariantSpark: applying Spark-based machine learning methods to genomic information](https://reader036.fdocuments.us/reader036/viewer/2022062523/58ed035a1a28ab443a8b4689/html5/thumbnails/16.jpg)
Presentation title | Presenter name16 |