Scalable Genome Analysis with ADAM

Scalable Genome Analysis With ADAM

Frank Austin Nothaft, UC Berkeley AMPLab [email protected], @fnothaft

7/23/2015

Analyzing genomes: What is our goal?

• Genomes are the “source” code for life:

• The human genome is a 3.2B character “program”, split across 46 “files”

• Within a species, genomes are ~99.9% similar

• The 0.1% variance gives rise to diverse traits, as well as diseases

The Sequencing AbstractionIt was the best of times, it was the worst of times…

Metaphor borrowed from Michael Schatz

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

• Sequencing is a Poission substring sampling process

• For $1,000, we can sequence a 30x copy of your genome

Genome Resequencing• The Human Genome Project identified the “average”

genome from 20 individuals at $1B cost

• To make this process cheaper, we use our knowledge of the “average” genome to calculate a diff

• Two problems:

• How do we compute this diff?

• How do we make sense of the differences?

Alignment and AssemblyIt was the best of times, it was the worst of times…It was the

the best oftimes, it was

the worst ofworst of times

best of times was the worst

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

What do genomic analysis tools look like?

Genomics Pipelines

Source: The Broad Institute of MIT/Harvard

Flat File Formats

• Scientific data is typically stored in application specific file formats:

• Genomic reads: SAM/BAM, CRAM

• Genomic variants: VCF/BCF, MAF

• Genomic features: BED, NarrowPeak, GTF

Flat Architectures• APIs present very barebones abstractions:

• GATK: Sorted iterator over the genome

• Why are flat architectures bad?

1. Trivial: low level abstractions are not productive

2. Trivial: flat architectures create technical lock-in

3. Subtle: low level abstractions can introduce bugs

A green field approach

First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models







SchemaData Models

A schema provides a narrow waist

record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}







SchemaData Models

Accelerate common access patterns

• In genomics, we commonly have to find observations that overlap in a coordinate plane

• This coordinate plane is genomics specific, and is known a priori

• We can use our knowledge of the coordinate plane to implement a fast overlap join







SchemaData Models







SchemaData Models

Pick appropriate storage• When accessing scientific

datasets, we frequently slice and dice the dataset:

• Algorithms may touch subsets of columns

• We don’t always touch the whole dataset

• This is a good match for columnar storage







SchemaData Models







SchemaData Models

Is introducing a new data model really a good idea?

Source: XKCD, http://xkcd.com/927/

http://xkcd.com/927/

A subtle point:!Proper stack design can simplify

backwards compatibility

To support legacy data formats, you define a way to serialize/deserialize the schema into/from the

legacy flat file format!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution


SchemaData Models

A subtle point:!Proper stack design can simplify

backwards compatibility

This is a view!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution


SchemaData Models

Using the ADAM stack to analyze genetic variants

What are the challenges?

• The differences between people’s genomes leads to various traits and diseases, but:

• Variants don’t always have straightforward explanations

• 3B base genome with variation at 0.1% of locations —> lots of variants! How do we find the important one?

Making Sense of Variation• Variation in the genome can affect biology in

several ways:

• A variant can modify or break a protein

• A variant can modify how much of a protein is created

• The subset of your genome that encodes proteins is the exome. This is ~1% of your genome!

How do we link variants to traits?

• Statistical modeling!

• We may use a simple model (e.g., a 𝛘2 test)

• Or, we may use a more complex model (e.g., linear regression)

• This is known as a genotype-to-phenotype association test

• Most of these tests can be expressed as aggregation functions —> i.e., a reduction. This maps well to Spark!

What if two variants are close to each other?

• This phenomenon is known as linkage disequilibrium: if two variants are close to each other, they are likely to be inherited together

Levo and Segal, Nat Rev Gen, 2014.

• We can often link expression changes and trait changes to blocks of the genome, but not to specific variants

Can we break up these blocks?

• When a variant modifies a gene, we can predict how the gene is modified:

• Does this variant change how the protein is spliced?

• Does this variant change an amino acid?

• We can discard variants that do not change proteins

• However, if the variant is outside of a gene, how do we make sense of it?

• Are these variants important?

Mutations in AML��

��

�

��

�

� ��

��

��

��

� ��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

� ��

��

��

��

�

��

��

��

��

��

��

��

��

��

� �

��

��

� ��

� � � ��

��

��

��

��

��

� ��

��

��

��

� ��

��

��

��

��

!�!"�

��

��

��

��

��

��"

��

�!�!�

��

��!��

��

��

��

��

��

��

��

� ��

� !

��

��

��

��

��

� � ��

��

��

��

��

��

��!

��

� ��

��

��

��

� ��

��

� ��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

There is a big “long tail”, including people who have cancer, but who have no “modified” genes!

Ravi Pandya, BeatAML.

Looking outside of the Exome

• We analyze mutations in the exome using the grammar for protein creation

• Can we apply a similar approach outside of the exome?

• Let’s use the grammar for regulation instead!

S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

Grammar Enhanced Statistical Models

Levo and Segal, Nat Rev Gen, 2014.S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

You Can Help!• All of our projects are open source:

• https://www.github.com/bigdatagenomics

• Apache 2 licensed

• For a tutorial, see Chapter 10 of Advanced Analytics with Spark

• More details on ADAM in our latest paper

• The GA4GH is looking for coders to help enable genomic data sharing: https://github.com/ga4gh/server

https://www.github.com/bigdatagenomics

http://shop.oreilly.com/product/0636920035091.do

https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/adam.pdf

https://github.com/ga4gh/server

Acknowledgements• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey

Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!

• GenomeBridge: Carl Yeksigian!

• Cloudera: Uri Laserson, Tom White!

• Microsoft Research: Ravi Pandya, Bill Bolosky!

• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot!

• And many other open source contributors, especially Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordior!

• Total of 40 contributors to ADAM/BDG from >12 institutions

Scalable Genome Analysis with ADAM

Engineering

Transcript of Scalable Genome Analysis with ADAM