A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013
Scalable Genome Analysis with ADAM
-
Upload
fnothaft -
Category
Engineering
-
view
101 -
download
4
Transcript of Scalable Genome Analysis with ADAM
Scalable Genome Analysis With ADAM
Frank Austin Nothaft, UC Berkeley AMPLab [email protected], @fnothaft
7/23/2015
Analyzing genomes: What is our goal?
• Genomes are the “source” code for life:
• The human genome is a 3.2B character “program”, split across 46 “files”
• Within a species, genomes are ~99.9% similar
• The 0.1% variance gives rise to diverse traits, as well as diseases
The Sequencing AbstractionIt was the best of times, it was the worst of times…
Metaphor borrowed from Michael Schatz
It was thethe best of
times, it wasthe worst of
worst of timesbest of timeswas the worst
• Sequencing is a Poission substring sampling process
• For $1,000, we can sequence a 30x copy of your genome
Genome Resequencing• The Human Genome Project identified the “average”
genome from 20 individuals at $1B cost
• To make this process cheaper, we use our knowledge of the “average” genome to calculate a diff
• Two problems:
• How do we compute this diff?
• How do we make sense of the differences?
Alignment and AssemblyIt was the best of times, it was the worst of times…It was the
the best oftimes, it was
the worst ofworst of times
best of times was the worst
It was thethe best of
times, it wasthe worst of
worst of timesbest of timeswas the worst
What do genomic analysis tools look like?
Genomics Pipelines
Source: The Broad Institute of MIT/Harvard
Flat File Formats
• Scientific data is typically stored in application specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF
Flat Architectures• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are flat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: flat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs
A green field approach
First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
A schema provides a narrow waist
record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Accelerate common access patterns
• In genomics, we commonly have to find observations that overlap in a coordinate plane
• This coordinate plane is genomics specific, and is known a priori
• We can use our knowledge of the coordinate plane to implement a fast overlap join
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Pick appropriate storage• When accessing scientific
datasets, we frequently slice and dice the dataset:
• Algorithms may touch subsets of columns
• We don’t always touch the whole dataset
• This is a good match for columnar storage
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Is introducing a new data model really a good idea?
Source: XKCD, http://xkcd.com/927/
A subtle point:!Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized DataLegacy File Format
SchemaData Models
Data Distribution
Materialized DataColumnar Storage
SchemaData Models
A subtle point:!Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized DataLegacy File Format
SchemaData Models
Data Distribution
Materialized DataColumnar Storage
SchemaData Models
Using the ADAM stack to analyze genetic variants
What are the challenges?
• The differences between people’s genomes leads to various traits and diseases, but:
• Variants don’t always have straightforward explanations
• 3B base genome with variation at 0.1% of locations —> lots of variants! How do we find the important one?
Making Sense of Variation• Variation in the genome can affect biology in
several ways:
• A variant can modify or break a protein
• A variant can modify how much of a protein is created
• The subset of your genome that encodes proteins is the exome. This is ~1% of your genome!
How do we link variants to traits?
• Statistical modeling!
• We may use a simple model (e.g., a 𝛘2 test)
• Or, we may use a more complex model (e.g., linear regression)
• This is known as a genotype-to-phenotype association test
• Most of these tests can be expressed as aggregation functions —> i.e., a reduction. This maps well to Spark!
What if two variants are close to each other?
• This phenomenon is known as linkage disequilibrium: if two variants are close to each other, they are likely to be inherited together
Levo and Segal, Nat Rev Gen, 2014.
• We can often link expression changes and trait changes to blocks of the genome, but not to specific variants
Can we break up these blocks?
• When a variant modifies a gene, we can predict how the gene is modified:
• Does this variant change how the protein is spliced?
• Does this variant change an amino acid?
• We can discard variants that do not change proteins
• However, if the variant is outside of a gene, how do we make sense of it?
• Are these variants important?
Mutations in AML����
������
�
������
�
� ������
���
����
���
� ������
���
� ������
���
�����
��
���
��
���
��
���
��
����
���
���
�
�����
� ��
�����
��
�����
�
���
��
��� �
��
���
��� ��
����
�� ��
���
� �
������
��
� ��
� � � ��
���
�� �
����
�� ��
����
� ��
�����
�� ��
����
� ��
����
��� ��
������
��� ��
!�!"�
�� ��
������
��� ��
������
��� ��
��"
��� ��
�!�!�
��� ��
��!��
�� ��
��� ��
�����
�� ��
�� ��
���
�� �� � ��
� ��
� !
�� ��
�� ��
����
��
����
� � ��
���
�� �� �� ��
�� ��
����
��
���!
��
� ��
��
����
��
� ��
�� �
� ��
�� ��
���
��
�����
��
�� �� � ��
� ��
����
��
����
��
���� �
�� ��
����
��
������
�� � ��
����
�� �� � ��
� ��
�� ��
There is a big “long tail”, including people who have cancer, but who have no “modified” genes!
Ravi Pandya, BeatAML.
Looking outside of the Exome
• We analyze mutations in the exome using the grammar for protein creation
• Can we apply a similar approach outside of the exome?
• Let’s use the grammar for regulation instead!
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
Grammar Enhanced Statistical Models
Levo and Segal, Nat Rev Gen, 2014.S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
You Can Help!• All of our projects are open source:
• https://www.github.com/bigdatagenomics
• Apache 2 licensed
• For a tutorial, see Chapter 10 of Advanced Analytics with Spark
• More details on ADAM in our latest paper
• The GA4GH is looking for coders to help enable genomic data sharing: https://github.com/ga4gh/server
Acknowledgements• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson, Tom White!
• Microsoft Research: Ravi Pandya, Bill Bolosky!
• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot!
• And many other open source contributors, especially Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordior!
• Total of 40 contributors to ADAM/BDG from >12 institutions