Scalable Genome Analysis with ADAM

28
Scalable Genome Analysis With ADAM Frank Austin Nothaft, UC Berkeley AMPLab [email protected], @fnothaft 7/23/2015

Transcript of Scalable Genome Analysis with ADAM

Page 1: Scalable Genome Analysis with ADAM

Scalable Genome Analysis With ADAM

Frank Austin Nothaft, UC Berkeley AMPLab [email protected], @fnothaft

7/23/2015

Page 2: Scalable Genome Analysis with ADAM

Analyzing genomes: What is our goal?

• Genomes are the “source” code for life:

• The human genome is a 3.2B character “program”, split across 46 “files”

• Within a species, genomes are ~99.9% similar

• The 0.1% variance gives rise to diverse traits, as well as diseases

Page 3: Scalable Genome Analysis with ADAM

The Sequencing AbstractionIt was the best of times, it was the worst of times…

Metaphor borrowed from Michael Schatz

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

• Sequencing is a Poission substring sampling process

• For $1,000, we can sequence a 30x copy of your genome

Page 4: Scalable Genome Analysis with ADAM

Genome Resequencing• The Human Genome Project identified the “average”

genome from 20 individuals at $1B cost

• To make this process cheaper, we use our knowledge of the “average” genome to calculate a diff

• Two problems:

• How do we compute this diff?

• How do we make sense of the differences?

Page 5: Scalable Genome Analysis with ADAM

Alignment and AssemblyIt was the best of times, it was the worst of times…It was the

the best oftimes, it was

the worst ofworst of times

best of times was the worst

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

Page 6: Scalable Genome Analysis with ADAM

What do genomic analysis tools look like?

Page 7: Scalable Genome Analysis with ADAM

Genomics Pipelines

Source: The Broad Institute of MIT/Harvard

Page 8: Scalable Genome Analysis with ADAM

Flat File Formats

• Scientific data is typically stored in application specific file formats:

• Genomic reads: SAM/BAM, CRAM

• Genomic variants: VCF/BCF, MAF

• Genomic features: BED, NarrowPeak, GTF

Page 9: Scalable Genome Analysis with ADAM

Flat Architectures• APIs present very barebones abstractions:

• GATK: Sorted iterator over the genome

• Why are flat architectures bad?

1. Trivial: low level abstractions are not productive

2. Trivial: flat architectures create technical lock-in

3. Subtle: low level abstractions can introduce bugs

Page 10: Scalable Genome Analysis with ADAM

A green field approach

Page 11: Scalable Genome Analysis with ADAM

First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 12: Scalable Genome Analysis with ADAM

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

A schema provides a narrow waist

record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 13: Scalable Genome Analysis with ADAM

Accelerate common access patterns

• In genomics, we commonly have to find observations that overlap in a coordinate plane

• This coordinate plane is genomics specific, and is known a priori

• We can use our knowledge of the coordinate plane to implement a fast overlap join

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 14: Scalable Genome Analysis with ADAM

Pick appropriate storage• When accessing scientific

datasets, we frequently slice and dice the dataset:

• Algorithms may touch subsets of columns

• We don’t always touch the whole dataset

• This is a good match for columnar storage

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 15: Scalable Genome Analysis with ADAM

Is introducing a new data model really a good idea?

Source: XKCD, http://xkcd.com/927/

Page 16: Scalable Genome Analysis with ADAM

A subtle point:!Proper stack design can simplify

backwards compatibility

To support legacy data formats, you define a way to serialize/deserialize the schema into/from the

legacy flat file format!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

Materialized DataColumnar Storage

SchemaData Models

Page 17: Scalable Genome Analysis with ADAM

A subtle point:!Proper stack design can simplify

backwards compatibility

This is a view!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

Materialized DataColumnar Storage

SchemaData Models

Page 18: Scalable Genome Analysis with ADAM

Using the ADAM stack to analyze genetic variants

Page 19: Scalable Genome Analysis with ADAM

What are the challenges?

• The differences between people’s genomes leads to various traits and diseases, but:

• Variants don’t always have straightforward explanations

• 3B base genome with variation at 0.1% of locations —> lots of variants! How do we find the important one?

Page 20: Scalable Genome Analysis with ADAM

Making Sense of Variation• Variation in the genome can affect biology in

several ways:

• A variant can modify or break a protein

• A variant can modify how much of a protein is created

• The subset of your genome that encodes proteins is the exome. This is ~1% of your genome!

Page 21: Scalable Genome Analysis with ADAM

How do we link variants to traits?

• Statistical modeling!

• We may use a simple model (e.g., a 𝛘2 test)

• Or, we may use a more complex model (e.g., linear regression)

• This is known as a genotype-to-phenotype association test

• Most of these tests can be expressed as aggregation functions —> i.e., a reduction. This maps well to Spark!

Page 22: Scalable Genome Analysis with ADAM

What if two variants are close to each other?

• This phenomenon is known as linkage disequilibrium: if two variants are close to each other, they are likely to be inherited together

Levo and Segal, Nat Rev Gen, 2014.

• We can often link expression changes and trait changes to blocks of the genome, but not to specific variants

Page 23: Scalable Genome Analysis with ADAM

Can we break up these blocks?

• When a variant modifies a gene, we can predict how the gene is modified:

• Does this variant change how the protein is spliced?

• Does this variant change an amino acid?

• We can discard variants that do not change proteins

• However, if the variant is outside of a gene, how do we make sense of it?

• Are these variants important?

Page 24: Scalable Genome Analysis with ADAM

Mutations in AML����

������

������

� ������

���

����

���

� ������

���

� ������

���

�����

��

���

��

���

��

���

��

����

���

���

�����

� ��

�����

��

�����

���

��

��� �

��

���

��� ��

����

�� ��

���

� �

������

��

� ��

� � � ��

���

�� �

����

�� ��

����

� ��

�����

�� ��

����

� ��

����

��� ��

������

��� ��

!�!"�

�� ��

������

��� ��

������

��� ��

��"

��� ��

�!�!�

��� ��

��!��

�� ��

��� ��

�����

�� ��

�� ��

���

�� �� � ��

� ��

� !

�� ��

�� ��

����

��

����

� � ��

���

�� �� �� ��

�� ��

����

��

���!

��

� ��

��

����

��

� ��

�� �

� ��

�� ��

���

��

�����

��

�� �� � ��

� ��

����

��

����

��

���� �

�� ��

����

��

������

�� � ��

����

�� �� � ��

� ��

�� ��

There is a big “long tail”, including people who have cancer, but who have no “modified” genes!

Ravi Pandya, BeatAML.

Page 25: Scalable Genome Analysis with ADAM

Looking outside of the Exome

• We analyze mutations in the exome using the grammar for protein creation

• Can we apply a similar approach outside of the exome?

• Let’s use the grammar for regulation instead!

S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

Page 26: Scalable Genome Analysis with ADAM

Grammar Enhanced Statistical Models

Levo and Segal, Nat Rev Gen, 2014.S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

Page 27: Scalable Genome Analysis with ADAM

You Can Help!• All of our projects are open source:

• https://www.github.com/bigdatagenomics

• Apache 2 licensed

• For a tutorial, see Chapter 10 of Advanced Analytics with Spark

• More details on ADAM in our latest paper

• The GA4GH is looking for coders to help enable genomic data sharing: https://github.com/ga4gh/server

Page 28: Scalable Genome Analysis with ADAM

Acknowledgements• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey

Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!

• GenomeBridge: Carl Yeksigian!

• Cloudera: Uri Laserson, Tom White!

• Microsoft Research: Ravi Pandya, Bill Bolosky!

• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot!

• And many other open source contributors, especially Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordior!

• Total of 40 contributors to ADAM/BDG from >12 institutions