Scalable Genome Analysis with ADAM

Scalable Genome Analysis With ADAM

Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft

7/23/2015

Analyzing genomes: What is our goal?

• Genomes are the “source” code for life:

• The human genome is a 3.2B character “program”, split across 46 “files”

• Within a species, genomes are ~99.9% similar

• The 0.1% variance gives rise to diverse traits, as well as diseases

The Sequencing AbstractionIt was the best of times, it was the worst of times…

Metaphor borrowed from Michael Schatz

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

• Sequencing is a Poission substring sampling process

• For $1,000, we can sequence a 30x copy of your genome

Genome Resequencing• The Human Genome Project identified the “average”

genome from 20 individuals at $1B cost

• To make this process cheaper, we use our knowledge of the “average” genome to calculate a diff

• Two problems:

• How do we compute this diff?

• How do we make sense of the differences?

Alignment and AssemblyIt was the best of times, it was the worst of times…It was the

the best oftimes, it was

the worst ofworst of times

best of times was the worst

It was thethe best of

times, it wasthe worst of

worst of timesbest of timeswas the worst

What do genomic analysis tools look like?

Genomics Pipelines

Source: The Broad Institute of MIT/Harvard

Flat File Formats

• Scientific data is typically stored in application specific file formats:

• Genomic reads: SAM/BAM, CRAM

• Genomic variants: VCF/BCF, MAF

• Genomic features: BED, NarrowPeak, GTF

Flat Architectures• APIs present very barebones abstractions:

• GATK: Sorted iterator over the genome

• Why are flat architectures bad?

1. Trivial: low level abstractions are not productive

2. Trivial: flat architectures create technical lock-in

3. Subtle: low level abstractions can introduce bugs

A green field approach

First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

A schema provides a narrow waist

record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

SchemaData Models

Accelerate common access patterns

• In genomics, we commonly have to find observations that overlap in a coordinate plane

• This coordinate plane is genomics specific, and is known a priori

• We can use our knowledge of the coordinate plane to implement a fast overlap join

SchemaData Models

Pick appropriate storage• When accessing scientific

datasets, we frequently slice and dice the dataset:

• Algorithms may touch subsets of columns

• We don’t always touch the whole dataset

• This is a good match for columnar storage

SchemaData Models

Is introducing a new data model really a good idea?

Source: XKCD, http://xkcd.com/927/

A subtle point:!Proper stack design can simplify

backwards compatibility

To support legacy data formats, you define a way to serialize/deserialize the schema into/from the

legacy flat file format!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

SchemaData Models

A subtle point:!Proper stack design can simplify

backwards compatibility

This is a view!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

SchemaData Models

Using the ADAM stack to analyze genetic variants

What are the challenges?

• The differences between people’s genomes leads to various traits and diseases, but:

• Variants don’t always have straightforward explanations

• 3B base genome with variation at 0.1% of locations —> lots of variants! How do we find the important one?

Making Sense of Variation• Variation in the genome can affect biology in

several ways:

• A variant can modify or break a protein

• A variant can modify how much of a protein is created

• The subset of your genome that encodes proteins is the exome. This is ~1% of your genome!

How do we link variants to traits?

• Statistical modeling!

• We may use a simple model (e.g., a 𝛘2 test)

• Or, we may use a more complex model (e.g., linear regression)

• This is known as a genotype-to-phenotype association test

• Most of these tests can be expressed as aggregation functions —> i.e., a reduction. This maps well to Spark!

What if two variants are close to each other?

• This phenomenon is known as linkage disequilibrium: if two variants are close to each other, they are likely to be inherited together

Levo and Segal, Nat Rev Gen, 2014.

• We can often link expression changes and trait changes to blocks of the genome, but not to specific variants

Can we break up these blocks?

• When a variant modifies a gene, we can predict how the gene is modified:

• Does this variant change how the protein is spliced?

• Does this variant change an amino acid?

• We can discard variants that do not change proteins

• However, if the variant is outside of a gene, how do we make sense of it?

• Are these variants important?

Mutations in AML��

��

� ��

��

� ��

��

� ��

��

� ��

��

� �

��

� ��

� � � ��

��

� ��

��

� ��

��

!�!"�

��

��"

��

�!�!�

��

��!��

��

� ��

��

� � ��

��

��!

��

� ��

��

� ��

��

� ��

��

� ��

��

� ��

��

There is a big “long tail”, including people who have cancer, but who have no “modified” genes!

Ravi Pandya, BeatAML.

Looking outside of the Exome

• We analyze mutations in the exome using the grammar for protein creation

• Can we apply a similar approach outside of the exome?

• Let’s use the grammar for regulation instead!

S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

Grammar Enhanced Statistical Models

Levo and Segal, Nat Rev Gen, 2014.S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

You Can Help!• All of our projects are open source:

• https://www.github.com/bigdatagenomics

• Apache 2 licensed

• For a tutorial, see Chapter 10 of Advanced Analytics with Spark

• More details on ADAM in our latest paper

• The GA4GH is looking for coders to help enable genomic data sharing: https://github.com/ga4gh/server

Acknowledgements• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey

Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!

• GenomeBridge: Carl Yeksigian!

• Cloudera: Uri Laserson, Tom White!

• Microsoft Research: Ravi Pandya, Bill Bolosky!

• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot!

• And many other open source contributors, especially Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordior!

• Total of 40 contributors to ADAM/BDG from >12 institutions

Scalable Genome Analysis with ADAM

Engineering

Transcript of Scalable Genome Analysis with ADAM

Adam C. Siepel, Ph.D.siepellab.labsites.cshl.edu/wp-content/uploads/sites/40/CV-Siepel... · 1997 – 2001 Software Development Group Leader, National Center for Genome Resources

Scalable Semantic Web Data Management Using …Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi MIT dna@csail.mit.edu Adam Marcus MIT marcua@csail.mit.edu

Project Adam: Building an Efficient and Scalable Deep Learning ...

…A Scalable Systems Initiative Scalable Microsoft ...€¦ · A Scalable Systems Initiative Scalable Microsoft Application Re engineering Technology Scalable Systems Inc Web: systems.com

Project Adam: Building an Efficient and Scalable Deep ...brents/cs494-cdcs/papers/...This paper addresses the problem by describing the design and implementation of a scalable distributed

Genome Sequencing Highlights Genes Under …Genome Sequencing Highlights Genes Under Selection and the Dynamic Early History of Dogs Adam H. Freedman1, Rena M. Schweizer1, Ilan Gronau2,

scalable whole-genome single-cell library preparation ... · uniform single-cell genome coverage than that provided by DOP-PCR, and that merged single-cell libraries achieve a uniformity

Transportation Data Science at NREL · Transportation Data Science at NREL Adam Duran, Kenneth Kelly, Caleb Phillips ... • Hadoop • Kafka • Scalable Attached (Object) Storage

Lightcuts: A Scalable Approach to Illuminationbjw/lightcuts.pdfLightcuts: A Scalable Approach to Illumination Bruce Walter Sebastian Fernandez Adam Arbree Kavita Bala Michael Donikian

Physical Principlesfor Scalable Neural Recording · Physical Principlesfor Scalable Neural Recording Ã Adam H. Marblestone,1,2 ÃBradley M. Zamft,3 Yael G. Maguire,3,4 Mikhail G.

Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.

Project Adam: Building an Efficient and Scalable …web.eecs.umich.edu/~mosharaf/Readings/Project-Adam.pdfAdam’s efficiency enables training larger models with the same amount of

Project Adam: Building an Efficient and Scalable Deep Learning Training System Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman,

Massively Parallel Processing of Whole Genome Sequence ...aroy/gesall-sigmod17.pdf · large genome analysis pipelines. Change of Existing Software: ADAM [19] provides new formats,

Chromatiblock: scalable whole-genome visualization of ... · genomes. The global alignment view shows the arrangement of core blocks (i.e., syntenic regions found once in all genomes)

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: 110022478 Award: MSc (Computer & Information.

Scalable multiple whole-genome alignment and locally ... · de Bruijn graphs to identify and extend anchors into locally collinear blocks hold the potential for scalability, but current

Scalable multiple whole-genome alignment and locally ... · Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ Ilia Minkin 1 and Paul

Lightcuts: A Scalable Approach to Illumination Bruce Walter, Sebastian Fernandez, Adam Arbree, Mike Donikian, Kavita Bala, Donald Greenberg Program of.

UKOLN is supported by: Share your genome? Dealing with very personal data. Professor Adam Hedgecoe, Associate Director CESAGEN ESRC Centre for Social and.