Genomics Is Not Special: Towards Data Intensive Biology
-
Upload
uri-laserson -
Category
Science
-
view
2.534 -
download
2
description
Transcript of Genomics Is Not Special: Towards Data Intensive Biology
Genomics IsNot Special
Uri Laserson // [email protected] // 13 November 2014
Toward Data-Intensive Biology
2© 2014 Cloudera, Inc. All rights reserved.http://omicsmaps.com/
>25 Pbp / year
3© 2014 Cloudera, Inc. All rights reserved.
Carr and Church, Nat. Biotech. 27: 1151 (2009)
4© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
5© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
6© 2014 Cloudera, Inc. All rights reserved.
Based on IMGT/LIGM release 201111-6
7© 2014 Cloudera, Inc. All rights reserved.
8© 2014 Cloudera, Inc. All rights reserved.
9© 2014 Cloudera, Inc. All rights reserved.
Developer/computational efficiency becoming
paramount
Genome Biology 12: 125 (2011)
10© 2014 Cloudera, Inc. All rights reserved.
Software and data management around
since 1970s
• Version control/reproducibility
• Testing/automation/integration
• Databases and data formats
• API design
• Lots (most?) of big data innovation happening in industry
11© 2014 Cloudera, Inc. All rights reserved.
Example query
For each variant that is• overlapping a DNase HS site
• predicted to be deleterious
• absent from dbSNP
compute the MAF by subpopulation
using samples in Framing Heart Study
PARTNER LOGO
CH
R
POS RE
F
AL
T
POP MAF POLYPHEN
7 122892
37
A G Plain 0.01 possibly
damaging
7 122892
37
A G Star-
bellied
0.03 possibly
damaging
12 228833
2
T C Plain 0.00
3
probably
damaging
12 228833
2
T C Star-
bellied
0.09 probably
damaging
12© 2014 Cloudera, Inc. All rights reserved.
Available data
Data set Format Size
Population genotypes VCF 10-100s of
billions
Dnase HS sites
(ENCODE)
narrowPeak
(BED)
<1 million
dbSNP CSV 10s of millions
Sample phenotypes JSON thousands
13© 2014 Cloudera, Inc. All rights reserved.
Why text data is a bad idea
• Text is highly inefficient• Compresses poorly
• Values must be parsed
• Text is semi-structured at best• Flexible schemas make parsing difficult
• Difficult to make assumptions on data structure
• Text poorly separates the roles of delimiters and data• Requires escaping of control characters
• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
• But still almost always better than Excel
14© 2014 Cloudera, Inc. All rights reserved.
Some reasons VCF in particular is bad
• Number of records (variants) grows with new variants, rather than new genotypes• difficult to write data
• adding a sample requires rewrite of entire file
• Data must be sorted
• Semi-structured: need to build a parser for each file
• Conflates two functions:• catalogue of variation
• repository actual observed genotypes
• If gzipped, it’s not splittable
• Variants are not encoded uniquely by the VCF spec
15© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
16© 2014 Cloudera, Inc. All rights reserved.
Additional metadata must fit in memory
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
17© 2014 Cloudera, Inc. All rights reserved.
Can only read from POSIX filesystem
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
18© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
19© 2014 Cloudera, Inc. All rights reserved.
Genotype data may be split across files
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
20© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
• If file is gzipped, cannot split file without decompressing (use Snappy)
• Reading files required access to POSIX-style file system
• Probably want to split VCF file into pieces to parallelize• Requires manual scatter-gather
• Samples may be scattered among multiple VCF files (difficult to append to VCF)
• Manually implementing broadcast join• Build side must fit into memory
21© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
22© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
How to serialize intermediate output?
Manually specify requested resources
Manually split and mergeBabysit and
check for errors/failures
23© 2014 Cloudera, Inc. All rights reserved.
HPC separates compute from storage
HPC is about compute.Hadoop is about data.
Storage infrastructure• Proprietary,
distributed file system
• Expensive
Compute cluster• High-perf, reliable
hardware• Expensive
Big network
pipe ($$$)
User typically works by manually submitting jobs to scheduler(e.g., LSF, Grid Engine, etc.)
24© 2014 Cloudera, Inc. All rights reserved.
HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically through MPI• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
25© 2014 Cloudera, Inc. All rights reserved.
HPC uses file system as DB; text file as LCD
• All tools assume flat files with POSIX semantics
• Sharing data/collaboration involves copying large files
• Broad joint caller with 25k genomes hits file handle limits
• Files always streamed over network (HPC architecture)
26© 2014 Cloudera, Inc. All rights reserved.
HPC uses job scheduler as workflow tool
• Submitting jobs to scheduler is low level
• Workflow engines/execution models provide high level execution graphs with built-in fault tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive
27© 2014 Cloudera, Inc. All rights reserved.
Prepping data for local analysis in R/Python
• Manual script to prepare CSV file for working locally
• Same issues as above
• Requires working set of data to fit into memory of a single machine
• Visualization
28© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp
one of a limited set of specific, useful tasks
(yet another) custom query specification
29© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
• Works great if your problem fits into the pre-designed computations
• Only works if your problem fits into the pre-designed computations
• How to do stats by subpopulation?• Probably possible, but need to learn new notation
• Must work to get data in to begin-with
• Not obviously parallelizable for performance on large data sets
• Built on SQLite underneath
30© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
31© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
• Feature-rich and very mature
• Highly optimized and allows indexing
• Declarative (and abstracted) language for data
• Hassle to get data in; data end up formatted one way
• No clear scalability story
• SQL-only
32© 2014 Cloudera, Inc. All rights reserved.
Problems with old way
• Expensive
• No fault-tolerance
• No horizontal scalability
• Poor separation of data modeling and storage formats• File format proliferation
• Inefficient text formats
33© 2014 Cloudera, Inc. All rights reserved.
34© 2014 Cloudera, Inc. All rights reserved.
Indexing the web
• Web is Huge• Hundreds of millions of pages in 1999
• How do you index it?• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
35© 2014 Cloudera, Inc. All rights reserved.
36© 2014 Cloudera, Inc. All rights reserved.
Databases in 1999
• Buy a really big machine
• Install expensive DBMS on it
• Point your workload on it
• Hope it doesn’t fail
• Ambitious: buy another big machine as backup
37© 2014 Cloudera, Inc. All rights reserved.
38© 2014 Cloudera, Inc. All rights reserved.
Database limitations
• Didn’t scale horizontally• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking• Complex analysis (PageRank)
• Unstructured data
39© 2014 Cloudera, Inc. All rights reserved.
40© 2014 Cloudera, Inc. All rights reserved.
Google does something different
• Designed their own storage and processing infrastructure• Google File System (GFS) and MapReduce (MR)
• Goals: cheap, scalable, reliable
• General framework for large-scale batch computation
• Powered Google Search for many years• Still used internally to this day (millions of jobs)
41© 2014 Cloudera, Inc. All rights reserved.
Google benevolent enough to publish
2003 2004
42© 2014 Cloudera, Inc. All rights reserved.
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
43© 2014 Cloudera, Inc. All rights reserved.
Open-source proliferation
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data
processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubb
y
Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on
Hadoop
44© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides:
• Data centralization on HDFS• No rewriting data for each tool/application
• Data-local execution to avoid moving terabytes
• High-level execution engines• SQL (Impala, Hive)
• Relational algebra (Spark, MapReduce)
• Bulk synchronous parallel (GraphX)
• Distributed in-memory
• Built-in horizontal scalability and fault-tolerance
• Hadoop-friendly, evolvable serialization formats/RPC
45© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides serialization/RPC formats
(Avro)
• Specify schemas/services in user-friendly IDLs
• Code-generation to multiple languages (wire-compatible/portable)
• Compact, binary formats
• Support for schema evolution
• Like binary JSONrecord Feature {
union { null, string } featureId = null;union { null, string } featureType = null; // e.g., DNase HSunion { null, string } source = null; // e.g., BED, GFF fileunion { null, Contig } contig = null;union { null, long } start = null;union { null, long } end = null;union { null, Strand } strand = null;union { null, double } value = null;array<Dbxref> dbxrefs = [];array<string> parentIds = [];map<string> attributes = {};
}
46© 2014 Cloudera, Inc. All rights reserved.
APIs instead of file formats
• Service-oriented architectures (SOA) ensure stable contracts
• Allows for implementation changes with new technologies
• Software community has lots of experience with SOA, along with mature tools
• Can be implemented in language-independent fashion
47© 2014 Cloudera, Inc. All rights reserved.
Current file format hairball
48© 2014 Cloudera, Inc. All rights reserved.
API-oriented architecture
49© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
• Designed for general data storage
• Columnar format• read fewer bytes
• compression more efficient
• Splittable
• Avro/Thrift-compatible
• Predicate pushdown
• RLE, dictionary-encoding
50© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
51© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
Vertical partitioning(projection pushdown)
Horizontal partitioning(predicate pushdown)
Read only the data you
need!
+ =
52© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)Spark
ADAMquince guacamole …
bdg-f
orm
ats
(A
vro
/Parq
uet)
53© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: filesystem
[laserson@bottou01-10g ~]$ hadoop fs –ls /user/lasersonFound 16 itemsdrwx------ - laserson laserson 0 2014-11-12 16:00 .Trashdrwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStagingdrwx------ - laserson laserson 0 2014-06-07 13:27 .stagingdrwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kgdrwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigmldrwxr-xr-x - laserson laserson 0 2014-10-30 14:14 bookdrwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editingdrwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_textdrwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibportdrwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-pythondrwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udfdrwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymcdrwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmpdrwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratchdrwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs
54© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: batch MapReduce job
hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar \com.cloudera.science.vcf2parquet.VCFtoParquetDriver \hdfs:///path/to/variants.vcf \hdfs:///path/to/output.parquet
55© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
[laserson@bottou01-10g ~]$ spark-shell --master yarnWelcome to
____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.1.0/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)Type in expressions to have them evaluated.[...]
scala>
56© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
.saveAsNewAPIHadoopFile("path/to/output")
57© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)Spark
ADAMquince guacamole …
bdg-f
orm
ats
(A
vro
/Parq
uet)
58© 2014 Cloudera, Inc. All rights reserved.
Genomics ETL
.fastq .bam .vcf
.bed/.gtf/etc
short read
alignment
genotype calling analysis
59© 2014 Cloudera, Inc. All rights reserved.
Hadoop variant store architecture
Impala shell (SQL)
REST API
JDBC
SQL query
Impala engine
Hive metastoreResult
set
.parquet.vcf
ETL
60© 2014 Cloudera, Inc. All rights reserved.
Data denormalization
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
• Amortize join cost up-front• Replace joins with predicates
(allowing predicate pushdown)
61© 2014 Cloudera, Inc. All rights reserved.
Hadoop solution characteristics
• Data stored as Parquet columnar format for performance and compression
• Impala/Hive metastore provide unified, flexible data model
• Impala implements RDBMS-style operations (by experts in distributed systems)
• Spark offers flexible relational algebra operators (and in-memory computing)
• Built-in fault tolerance for computations and horizontal scalability
62© 2014 Cloudera, Inc. All rights reserved.
Example variant-filtering query
• “Give me all SNPs that are:• chromosome 16
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples”
• On full 1000 Genome data set• ~37 billion genotypes
• 14 node cluster
• query completion in several seconds
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
PARTNER LOGO
63© 2014 Cloudera, Inc. All rights reserved.
Other queries/use cases
• All-vs-all eQTL integrated with ENCODE• >120 billion p-values
• “Top 20 eQTLs for 5 genes of interest”: interactive
• “Find all cis-eQTLs”: several minutes
• Population genetics queries (e.g., backend for PLINK)
• Interval arithmetic on large ENCODE data sets
• Duke CHGV• ATAV DSL for preparing data for GWAS
• Week-long queries now take a few hours by parallelizing on Spark
64© 2014 Cloudera, Inc. All rights reserved.
Computational biologists are reinventing the
wheel
• e.g., CRAM (columnar storage)
• e.g., workflow managers (Galaxy)
• e.g., GATK (scatter-gather)
65© 2014 Cloudera, Inc. All rights reserved.
Large-scale data analysis has been solved*
• Cheaper in terms of hardware
• Easier in terms of productivity
• Built-in horizontal scaling
• Built-in fault tolerance
• Layered abstractions for data modeling
• Hadoop!
66© 2014 Cloudera, Inc. All rights reserved.
Science on Hadoop
• ADAM project for genomics on Spark• http://bdgenomics.org/
• Guacamole for somatic variation on Spark• https://github.com/hammerlab/guacamole/
• Thunder project for neuroimaging on Spark• http://thefreemanlab.com/thunder/
• Quince for variant store on Impala• currently barebones, but with examples
• https://github.com/laserson/quince
67© 2014 Cloudera, Inc. All rights reserved.
Suggestions/resources
• Everyone should learn Python• (also, everyone should try some experiments)
• Everyone should use version control (e.g., git)• GitHub enables easy collaboration
• See Titus Brown’s blog
• Use the IPython Notebook (Jupyter) for productivity
• Big data is often about engineering; use the best tools
• For getting industry jobs:• Show people you know how to code: put your projects on GitHub
• You should feel lucky if others will start using your code
68© 2014 Cloudera, Inc. All rights reserved.
69© 2014 Cloudera, Inc. All rights reserved.
Acknowledgements
• Cloudera• Sandy Ryza (Spark development)
• Nong Li (Impala)
• Skye Wanderman-Milne (Impala)
• Impala genomics collaborators• Kiran Mukhyala
• Slaton Lipscomb
• ADAM project• Matt Massie
• Frank Nothaft
• Timothy Danford
• Mount Sinai School of Medicine• Jeff Hammerbacher (+ lab)
• Duke CHGV• Jonathan Keebler
Thank you.