Hadoop as a Platform for Genomics

45
Hadoop as a Platform for Genomics @AllenDay, Chief Scientist Sungwook Yoon, Data Scientist Data Science @MapR

Transcript of Hadoop as a Platform for Genomics

Page 1: Hadoop as a Platform for Genomics

Hadoop as a Platform for Genomics

@AllenDay, Chief Scientist

Sungwook Yoon, Data Scientist

Data Science @MapR

Page 2: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 2

DNA Sequencing, pre-2004

years

CPU transistors/mm2

HDD GB/mm2

DNA bp/$, pre-2004

Page 3: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 3

DNA Sequencing, 2004 Disruption

years

CPU transistors/mm2

HDD GB/mm2 DNA

bp/$, post-2004

DNA bp/$, pre-2004

Page 4: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 4

DNA Sequencing, 2004 Disruption

years

CPU transistors/mm2

HDD GB/mm2 DNA

bp/$, post-2004

DNA bp/$, pre-2004

Similar disruption occurred for Internet traffic in mid-1990s

Page 5: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 5

Effect: Many DNA-Based Apps Coming…

•  2014: US$ 2B, mostly research, mostly chemical costs

•  2020: US$ 20B, mostly clinical, mostly analytics costs

Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning

0

5

10

15

20

25

2014 2020

Clinical Non-Clinical

Page 6: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 6 © 2014 MapR Technologies ®

1. What Kind of Analytics Apps? 2. How do they Work?

Page 7: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 7

Target Audience •  Fluency in computing, math •  Basic knowledge of genetics, DNA

…so expect some encapsulated complexity

http://xkcd.com/803/

Page 8: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 8

Clinical Sequencing Business Process Workflow

Physician Patient

Clinic

blood/saliva

Clinical Lab

Analytics

extract

Page 9: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 9

Step 1: Identify all the Single Nucleotide Polymorphisms •  Currently ~12MM known SNPs •  Each person has a unique Genotype

–  Typically 3-5MM SNPs

–  Relative to a reference human –  diff this.human other.human,

essentially •  Inherited from parents

•  Inexpensive to find as sequencing costs have plummeted

http://learn.genetics.utah.edu/content/pharma/snips/

Page 10: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 10

Step 2: Characterize all the SNPs (ML, AI)

Other data & algorithms

JOIN

Page 11: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 11

Innovation Opportunities

Pop. Freq

Drug A Response

Drug B Response

10% Good Good

30% Poor Fair

30% Excellent Poor

30% Good, but Toxic

Fair

“Nil nocere” – do no harm

Step 3: Use Genotype to Customize Therapy

Page 12: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 12

Jan 30: Obama Unveils “Precision Medicine” Initiative “Most medical treatments have been designed for the ‘average patient’ … treatments can be very successful for some patients but not for others.”

http://www.msnbc.com/msnbc/obama-seeks-215-million-personalized-medicine

Page 13: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 13

Application: Forensic Analysis

http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899

Page 14: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 14

http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law

Moore’s Law #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)

Page 15: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 15 © 2014 MapR Technologies ®

1. What Kind of Analytics Apps? 2. How do they Work?

Page 16: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 16

Genome Sequencing in a Nutshell Reference Human Patient

Reference Genome

¢

¢

¢

¢

¢

¢

¢

De novo sequencing + assembly Resequencing

Patient Genotype

Page 17: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 17

Population-Scale Genome Biobanking

Page 18: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 18

GATK: Typical Tool for DNA=>Genotype Conversion Advantages •  No consensus alternative… yet •  Works! •  Already deployed and being used to save lives Disadvantages •  Map-Reduce but not Hadoop (and no plans to support) •  Compute context cannot span multiple nodes •  Inefficient use of shared memory (even within one node) •  Inefficient asymmetric joins. No leverage of context, data locality

Page 19: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 19

GATK: flat after chromosome split

Page 20: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 20

Big Picture

N DNA Input Records

All SNPs

Catalog still growing; Genotype space huge ≫ 8E37

Personal input is fixed N records and trivial to cut into P partitions

G G

A good implementation: scales O(N) ~ F(N,P) But GATK is SLOW: scales O(N) ~ F(Genotypes) GATK parallelization metrics / DEAD END attempts: https://github.com/allenday/sequencing-utils

Page 21: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 21

Bigger Picture: Human Suffering •  Widely disliked. Reduction of suffering is good business.

Even Bigger •  Is it morally wrong to allow others to suffer? •  If you agree, and there’s a way to reduce suffering,

then…

•  We can argue there is a moral imperative to build the most efficient, dependable, inexpensive solution possible

Page 22: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 22 © 2014 MapR Technologies ®

From Feasible to Easy & Efficient

Page 23: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 23

Two Phases of Genome Data Analysis

•  Batch Sequence Processing –  Align the reads to correct location –  Make correct Variants detection through statistical modeling

•  Genome / Phenome Data Analysis –  Find relevant Genotypes for Phenotypes –  Find relevant Phenotypes for Genotypes

Page 24: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 24

Genome Processing Requirements

Big Storage Big Memory Algorithms

Sorting

Group By

Clustering

Sparse Matrix

Distributed Processing

Which Free SW Has This Solution?

2TB per person

Affordable Hardware

Forward Backward

Page 25: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 25

Genome Processing Needs More Than Hadoop

•  Strong In Memory Computation

•  Strong Sparse Matrix Computation

Which Free SW Has This Solution?

Page 26: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 26

Still One More

Genome Data Format Definition

(A 1 Z) (B 1 Z) (C 1 Z)

A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z

Record 1 Record 2 Record 3

RowBased ColBased

Sorting Group MLLib

Page 27: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 27

Compute Engines

Data Workflow

Adam Pipeline

FastQ BAM ADAM ADAM-VCF VCF

Avocado ADAM ADAM Aligner

Super Fast •  In-memory •  Scalable compute

context

Pipeline in Genomics Data Workflow, a sequence of data transformation from DNA sequence read to Variant Calls

Page 28: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 28

Scale with Machines

From ADAM Tech Report

Page 29: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 29

That’s A lot but it just is a start

•  Why do we want sequencing? – To catch criminals ??

•  Police State??

•  Deeper wider genome study may reveal – Future medicine – Cure for diseases – Maybe … find Heroes??

Page 30: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 30

Variants Accumulate – Need a Scalable Variant Store

ADAM ADAM-VCF

Page 31: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 31

Genome × Phenome Analysis

For given population,

given SNP 𝛿, and

given phenotype ϕ: Count the number of occurrences as the value of the matrix

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

SPARSE Billion + Phenotypes

SPA

RS

E B

illion + Genotypes

Page 32: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 32

Interpreting Genome × Phenome Matrix Factorization Result •  Row Vectors of X represents

–  Archetype set of phenotypes

•  Column vectors of Y represents –  Archetype set of genotypes

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1 Principal Column Vector

Archetype Genotypes

Archetype Phenotypes

Principal Row Vector

Sparse Matrix Package is Actively Developed in Spark

Community

Page 33: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 33

Toward Heroes : Genome × Phenome Tensor •  Aggregating over individuals with matrix could ignore the

correlations among genotypes and phenotypes •  Maintain individual identity

Variants

Phenotypes

Variants

Phenotypes

Page 34: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 34

Tensor Factorization (Parafac) G

enom

e Va

riant

s

Phenome ≈

Prin

cipa

l Va

riant

s1

Principal Phenotypes1

Page 35: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 35 © 2014 MapR Technologies ®

From Imaginable to Possible

Page 36: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 36

Genome needs Hadoop

Variant Calling

DNA Sequencer

Reads

Reference Genome

Genotype/ Phenotype/ Individual

Matrix

Cure & Prevent Disease

Medical Records

Patient

Page 37: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 37

Scalable Variant Store – Data Mining

Model P ~ F(G) Fortunately, this has already been done…

Genotypes Med Record Phenotypes, e.g. disease risk, drug response

Page 38: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 38

Largest Biometric Database in the World

PEOPLE

1.2B PEOPLE

Page 39: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 39

Why Create Aadhaar? •  India: 1.2 billion residents

–  640,000 villages, ~60% lives under $2/day –  ~75% literacy, <3% pay income tax, <20% have bank accounts –  ~800 million mobile, ~200-300 million migrant workers

•  Govt. spends about $25-40 billion on direct subsidies –  Residents have no standard identity document –  Most programs plagued with ghost and multiple identities causing

leakage of 30-40%

Standardize identity => Stop leakage

Page 40: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 40

Aadhaar Biometric Capture & Index

Raw Digital Fingerprint

Page 41: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 41

Aadhaar Biometric ID Creation

F(x): unique features G(x): uncommon features H(x): other features

•  900MM people loaded in 4 years

•  In production –  1MM registrations/day –  200+ trillion lookups/day

•  All built on MapR-DB (HBase)

Page 42: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 42

How Does this Relate to Genomics?

F(x): unique features G(x): uncommon features H(x): other features

Same data shape and size •  Aadhaar: 1B humans, 5MB minutia •  Genome: 7B humans, ~3M variants

Page 43: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 43

How Does this Relate to Genomics?

F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features

Same data shape and size •  Aadhaar: 1B humans, 5MB minutia •  Genome: 6B humans, ~3M variants •  Genome: variant × phenotype •  Common variant => effect-causing

gene F-1(x) !

Same data set operations

Page 44: Hadoop as a Platform for Genomics

®© 2014 MapR Technologies 44

Genotype/ Phenotype/ Individual

Matrix

indi

vidu

als

fingerprint minutiae

Find genetic basis of fingerprints

med

ical

reco

rds

genetic variants

Find genetic basis of disease

Page 45: Hadoop as a Platform for Genomics

© 2014 MapR Technologies, confidential ®

Thanks! Questions?

@allenday, @mapr

[email protected], [email protected]

linkedin.com/in/allenday