Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by Ryan Williams
-
Upload
spark-summit -
Category
Data & Analytics
-
view
139 -
download
0
Transcript of Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by Ryan Williams
Algorithms & Tools for Genomic Analysis on
SparkRyan WilliamsHammer Lab @ Mt. Sinai School of Medicinehttp://bit.ly/sse2017
Agenda• Intro• Genomics crash course• Genomics applications• Fun with magic-rdds• Scala/Spark project mgmt notes• Questions
Hammer Lab• Est. 2013 @ Mt. Sinai School of Medicine• Initial focus: high-quality bioinformatics/genomics software
– distributed systems– OSS– static-typing + functional idioms
• Present-day: cancer immunotherapy research– personal-cancer-vaccine clinical trials– post-hoc clinical-data analysis
• This talk: ≈3yrs of genomic-analysis tool-building w/ Spark
www.thunderbolts.info
• Excessive repetitiveness– 20% retrotransposons– L1: 7000bp, 100k copies– Pseudogenes
• Impossible to resolve with “short reads”
Genome structure makes things difficult
Digitizing Human Genomes• 1 genome ≈ 3B base-pairs• Theory:
– “2 bits per base-pair” (A, C, G, T)– ⇒ 1 genome ≈ 750MB– <1% unique, person to person– 7BN genomes ≈ 50PB?
• Reality:– 1BN 100bp “reads”– ⇒ 100BN sequenced bases– Cover the genome at average depth 30 (“30x coverage”)– 2-bit base, 1-byte quality score ⇒ 100GB / genome– 100-100k genomes ⇒ 10TB-10PB