Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole...

59
Dipping into Guacamole Tim O’Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016

Transcript of Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole...

Page 1: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Dipping into Guacamole

Tim O’Donnell & Ryan Williams

NYC Big Data Genetics Meetup

Aug 11, 2016

Page 2: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Who we are: Hammer Lab

● Computational lab in the

department of Genetics

and Genomic Sciences at

Mount Sinai

● Principal investigator:

Jeff Hammerbacher

● Focus on informatics for

cancer immunotherapy

● Software developed at

github.com/hammerlab

Page 3: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Guacamole● Defines abstractions for working with aligned

short reads on Apache Spark

● Two somatic callers implemented:

○ Somatic Joint Caller makes use of multiple DNA and RNA

samples from the same patient

○ Assembly Caller performs local reassembly for improved

indel detection

● Calls variants on 250X whole genome

tumor/normal samples in 30 minutes on our 100

node Hadoop cluster

● Callers are still primitive, validation and tuning is

ongoing

Page 4: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Talk overview

● Background on DNA, RNA, and sequencing

● Guacamole infrastructure

● Variant calling on Guacamole

● Preliminary results

● Final thoughts

Page 5: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Image source: https://midhathsblog.wordpress.com/2012/04/24/a-complex-mechanism-preventing-mutation-of-important-cells/

Page 6: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Applications of cancer sequencing

● Understand the biology and find new drug targets

● Identify subtypes with similar prognosis or treatment

response

● Understand the non-cancer cells surrounding the tumor

● Trace how metastases spread

● Clinical

○ Diagnose

○ Select targeted therapies

Page 7: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic
Page 8: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic
Page 9: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Therapeutic vaccine

Image source: http://www.ncbi.nlm.nih.gov/pubmed/24777245

Page 10: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

DNA, RNA, and sequencing

Page 11: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Image source: https://en.wikipedia.org/wiki/DNA#/media/File:DNA_Structure%2BKey%2BLabelled.pn_NoBB.png

Page 12: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Image source: http://www-tc.pbs.org/wgbh/nova/assets/img/ten-great-advances-evolution/image-05-large.jpg

Small variation

Page 13: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Next generation sequencing

● Pyrosequencing (454)

● Sequencing by ligation (SOLiD)

● Semiconductor (Ion Torrent)

● Bridge amplification + reversible terminator (Illumina)

● Single molecule real time (PacBio)

● Nanopore sequencing (Oxford Nanopore)

Page 14: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Image source: https://www.sec.gov/Archives/edgar/data/913275/000095012306014236/x27170e425.htm

Solexa / Illumina

Page 15: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Image source:

Base calling

Page 16: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Read mapping

Page 17: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Somatic Variant Calling

Page 18: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Agreement on somatic variant calls across tools is surprisingly poor

Krøigård, A.B. et al., 2016. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and

Targeted Deep Sequencing Data. Plos One, 11(3), p.e0151664. Available at:

http://dx.plos.org/10.1371/journal.pone.0151664.

Page 19: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Guacamole infrastructure

Page 20: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Apache Spark

● Generalization of MapReduce

● Much more expressive than MapReduce

● Can work on data iteratively in memory

https://www.mapr.com/blog/getting-started-spark-web-ui

Page 21: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Existing work

● GATK uses a custom built-in mapreduce framework, but

cannot take advantage of co-localizing compute and data

● The ADAM project is also working on tools similar to

Guacamole

https://software.broadinstitute.org/gatk/guide/article?id=1988

Page 22: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Guacamole infrastructure

● Distributed implementations of functionality used by

most variant callers

● High level: build “pileups”, call variants

Page 23: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Pileups

A pileup gives the read bases aligned to a particular genomic

position

A C T G G G C A A A A A A C T T G G C

A C T G G G C A A A A A A C T T G

C T G G G C A A A A A A T T T G G C

C A A A T T T G G C

Reference Sequence

Rea

ds

Page 24: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

A typical Guacamole analysis

1. Partition the genome

2. Partition reads according to (1)

3. Build pileups at each site

4. Apply user-supplied function at each pileup

5. Write output

Page 25: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Coverage

Beware of the long tail! Max depth is often >100k

Page 26: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Step 1: Partition the genome● Partition the genome into intervals, balancing the number of reads

overlapping each partition

● Each interval will correspond to one Spark partition

Chromosome 1 Chromosome 2 Chromosome …

Partition NPartition 1 Partition 2 …

Extremely high depth (>500k) regions are removed

Page 27: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

● All-to-all shuffle of reads based on the genomic partition in Step 1

● A copy of each read goes to each partition it overlaps

Read 2

Step 2: Partition reads

Chromosome 1, positions 1000-1500

Read 1 Read 3

Partition 1 Partition 2

(Partition 1, Read 1)(Partition 1, Read 2)

(Partition 2, Read 2)(Partition 2, Read 3)

Reads overlapping multiple partitions are duplicated

Page 28: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Step 3: Each partition streams through reads to generate pileups

Partition 1A C T G G G C A A A A A A C T T G G C

A C T G G G C A A A A A A C T T G

C T G G G C A A A A A A T T T G G C

C A A A T T T G G C

C T T G G C

Pileup at chr1:1209

Sorted reads

Read at chr1:1200Read at chr1:1201Read at chr1:1209Read at chr1:1213...

f(chr1:1209, [A,A,C])

User-supplied function f is called on each pileup

Page 29: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Implementation notes

● Read partitioning is accomplished with Spark’s

repartitionAndSortWithinPartitions to send

(partition ID, read) pairs to each spark partition and sort

by start position

● We never materialize pileups into RDDs. Just stream

through reads and build pileups on the fly

Page 30: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Other features of the framework

● Sliding windows: for each site, expose all reads within

some margin in each direction. Used instead of Pileups

by more sophisticated callers

● “Regions”: generalization of reads. Things such as

variants align to the genome just like a read and can be

worked with similarly

Page 31: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Variant calling on Guacamole

Page 32: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Somatic joint caller motivation

For many studies we have

multiple tumor samples

from the same patient

● DNA and RNA

● Metastases

● Pre and post-treatment,

relapse, etc.Example: the Australian Ovarian Cancer Study performed whole genome and RNA sequencing on multiple tumor samples from 12 patients

Page 33: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Somatic joint caller

● Calls SNVs and small indels on any number of tumor and

normal samples from the same patient

● Also considers RNA-seq when available

● Likelihood and filters are similar to Mutect

Page 34: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Mixtures

● Across samples, select a consensus variant allele

● For each sample, define a reference mixture and a variant mixture

Page 35: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Likelihoods

● For each sample, using the bases sequenced at the site

and their quality scores, compute two likelihoods:○ Likelihood of data given the reference mixture○ Likelihood of the data given the variant mixture

Page 36: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Priors and posteriors

Posterior Likelihood Prior

Remember Bayes Rule?

Page 37: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

A very simple prior

● Given no data, how surprised would we be to find a

somatic variant?

Page 38: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Triggering

● For each sample we compare the posterior probabilities

of the variant and reference mixtures

● We also make a “pooled” sample by combining all the

tumor DNA reads across samples, and compute the same

quantities for that

● If the variant mixture has a higher probability in any

sample or in the pooled sample, we trigger a variant call

Page 39: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Incorporating RNA as a change of prior

● RNA alone should not trigger a variant call (e.g. RNA

editing)

● RNA support should make us more likely to call a variant

● Currently implement as a change of prior

Page 40: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Filters

● Likelihood computation is naive: assumes read bases are

independent of each other given the underlying mixture

● This is not true, as there are many well documented

sequencing artifacts

● Similarly to Mutect, we apply filters to calls to deal with

these artifacts

● Currently implemented filters:○ Strand bias○ Multiallelic sites○ Insufficient normal○ Poor mapping quality

Page 41: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Assembly calling

● We also have an experimental germline caller that does

assembly similar to GATK HaplotypeCaller

● Currently testing on Illumina Platinum genomes

● We hope to combine this with joint calling eventually to

make something comparable to Mutect 2

Page 42: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Results

Page 43: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Testing cluster

Hardware

Nodes 100

Cores 2400

Memory 12.5 TB

Storage 3.6 PB

Software

Spark 1.6.1

Hadoop 2.6.0-cdh5.5.1

OS CentOS 7.2.1511

Page 44: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Guacamole speed

Guacamole

2 WGS samples (DREAM Synth4) 22 minutes

3 Whole Genome Samples (AOCS-034)

31 minutes

10 Whole Exome Samples (PT189) 52 minutes

Mutect

2 WGS samples(DREAM Synth4)chromosome 1

158 minutes

We compare to chromosome 1 single-node runtime because Mutect is typically parallelized by chromosome, and chromosome 1 takes the longest

Page 45: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Genome partitioning

DREAM Synth4 dataset

Usually drop about 10 kilobases of sequence due to extremely high depth

Page 46: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Benchmarks

Validated Non-validated

TCGA validated calls (sensitivity only) Australian ovarian cancer study

patient 034

DREAM SYNTH1

We are collecting variant calling benchmarks at

http://github.com/hammerlab/variant-calling-benchmarks

Page 47: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

TCGA: Source of validated variants?

● A small fraction of TCGA

samples have targeted

sequencing validation of variant

calls

● These can be used to measure

recall, but not precision, of a

caller

Page 48: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Sensitivity on TCGA validated variants

● Thresholds

chosen by eye

performed poorly

● Tuning is difficult

using only recall

Page 49: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

DREAM SMC Challenge

● Public contest for somatic variant calling

● Synthetic data for a number of “patients” with

increasingly difficult mutations to call○ Eventually there will be validation performed on real data

Page 50: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Hand picked thresholds perform poorly

Page 51: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Model-based optimization on synth1

Features

● Raw likelihood

● Difference of ref and alt

likelihood

● Variant allele fractions

● Allele depths

● Strand bias

Methodology

● Logistic Regression

● 1:1 train/test split

Page 52: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Random forest does better

Page 53: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Australian Ovarian Cancer Study

● Whole genome sequencing of two tumors from a patient

in the Australian Ovarian Cancer Study

● Does the joint caller find new variants by pooling the

samples?

Page 54: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Calls in agreement with published set were triggered by both individual and pooled samples

Probably indicates our pooled trigger is over-eager

Page 55: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Calls not present in the published set were triggered by a mix

Page 56: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Final thoughts

Page 57: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Open questions

● Should we stick with hard filters or switch to a

data-driven model?

● How to integrate joint calling with assembly-based

calling?

● Validation

Page 58: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Spark takeaways

● If you have experience working with distributed systems,

you may be able to use Spark effectively on your problem

● It is not magic, and will not keep you from having to think

hard about problems like skew

● Ease of debugging has been poor, but is improving

● Finding the right settings for memory sizes, number of

cores per node, etc. can be hard. These parameters

change for each new dataset we run on

Page 59: Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Thanks!