Petascale AnalyJcs in Genomics with...
Transcript of Petascale AnalyJcs in Genomics with...
![Page 1: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/1.jpg)
1©Cloudera,Inc.Allrightsreserved.
11April2016,SwissBigDataUserGroupTomWhite|@tom_e_white
PetascaleAnalyJcsinGenomicswithHadoop
![Page 2: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/2.jpg)
2©Cloudera,Inc.Allrightsreserved.
• DataScienceTeamatCloudera
• ApacheHadoopCommiLer,PMCMember,ApacheMember
• Authorof“Hadoop:TheDefiniJveGuide”
AboutMe
![Page 3: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/3.jpg)
3©Cloudera,Inc.Allrightsreserved.
Whatisgenomics?
![Page 4: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/4.jpg)
4©Cloudera,Inc.Allrightsreserved.
Organism Cell Genome
![Page 5: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/5.jpg)
5©Cloudera,Inc.Allrightsreserved.
![Page 6: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/6.jpg)
6©Cloudera,Inc.Allrightsreserved.
Referencechromosome
Loca4on
![Page 7: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/7.jpg)
7©Cloudera,Inc.Allrightsreserved.“…decodingtheBookofLife”
![Page 8: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/8.jpg)
8©Cloudera,Inc.Allrightsreserved.
![Page 9: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/9.jpg)
9©Cloudera,Inc.Allrightsreserved.
![Page 10: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/10.jpg)
10©Cloudera,Inc.Allrightsreserved.
![Page 11: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/11.jpg)
11©Cloudera,Inc.Allrightsreserved.
![Page 12: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/12.jpg)
12©Cloudera,Inc.Allrightsreserved.
![Page 13: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/13.jpg)
13©Cloudera,Inc.Allrightsreserved.
![Page 14: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/14.jpg)
14©Cloudera,Inc.Allrightsreserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2
AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinforma4cs!
![Page 15: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/15.jpg)
15©Cloudera,Inc.Allrightsreserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Pipelines!
![Page 16: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/16.jpg)
16©Cloudera,Inc.Allrightsreserved.
##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressedtextfiles(non-spliKable)Semi-structuredPoorlyspecified
Globalsortorder
![Page 17: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/17.jpg)
17©Cloudera,Inc.Allrightsreserved.
.fastq .sam .bam .vcf .gvcf .bcf
bwa
.bed .gff .ped
PLINK/SEQ
Avro IDL(schemas)
bedtools
samtoolsbcftools
htslib GATK
File formats(e.g., Parquet
columnar)
ExecutionModel
ApplicationLayer
DataModel
StorageModel
Hadoopdistributedfilesystem
HBasedistributedkey-value
S3cloud
storage
NFS(POSIX)
SNAP, ADAM(bwa, Picard, GATK)
MapReduce(incl. distributed
in-memory)
distributedSQL queries
PLINK/SEQ genomeassembly
distributed graphcomputation (BSP)
CHPC(scheduler)POSIXfilesystem
JavaHPC(Queue)POSIXfilesystem
C++Single-nodeSQLite
It’sfileformatsallthewaydown!
![Page 18: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/18.jpg)
18©Cloudera,Inc.Allrightsreserved.
Dedup
![Page 19: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/19.jpg)
19©Cloudera,Inc.Allrightsreserved.
/***Mainworkmethod.ReadstheBAMfileonceandcollectssortedinformationabout*the5'endsofbothendsofeachread(orjustoneendinthecaseofpairs).*Thenmakesapassthroughthosedeterminingduplicatesbeforere-readingthe*inputfileandwritingitoutwithduplicationflagssetcorrectly.*/protectedintdoWork(){//buildsomedatastructuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);
finalSAMFileWriterout=newSAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader,true,OUTPUT);finalCloseableIterator<SAMRecord>iterator=headerAndIterator.iterator;while(iterator.hasNext()){finalSAMRecordrec=iterator.next();if(!rec.isSecondaryOrSupplementary()){if(recordInFileIndex==nextDuplicateIndex){rec.setDuplicateReadFlag(true);//Nowtryandfigureoutthenextduplicateindexif(this.duplicateIndexes.hasNext()){nextDuplicateIndex=this.duplicateIndexes.next();}else{//Onlyhappensoncewe'vemarkedalltheduplicatesnextDuplicateIndex=-1;}}else{rec.setDuplicateReadFlag(false);}}recordInFileIndex++;if(!this.REMOVE_DUPLICATES||!rec.getDuplicateReadFlag()){out.addAlignment(rec);
Method
Code
![Page 20: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/20.jpg)
20©Cloudera,Inc.Allrightsreserved.
@Option(shortName="MAX_FILE_HANDLES",doc="Maximumnumberoffilehandlestokeepopenwhenspilling"+"readendstodisk.Setthisnumberalittlelowerthanthe"+"per-processmaximumnumberoffilethatmaybeopen.This"+"numbercanbefoundbyexecutingthe'ulimit-n'commandon"+"aUnixsystem.")publicintMAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000;
Dedup
Method
Code
PlaSorm
![Page 21: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/21.jpg)
21©Cloudera,Inc.Allrightsreserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
![Page 22: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/22.jpg)
22©Cloudera,Inc.Allrightsreserved.
It’spipelinesallthewaydown!
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Node1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Node2
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Node3
![Page 23: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/23.jpg)
23©Cloudera,Inc.Allrightsreserved.
![Page 24: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/24.jpg)
24©Cloudera,Inc.Allrightsreserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
![Page 25: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/25.jpg)
25©Cloudera,Inc.Allrightsreserved.
Node1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnota4on
Node2
Node3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node4
![Page 26: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/26.jpg)
26©Cloudera,Inc.Allrightsreserved.
Node1
Alignment Dedup QC/FilterVariantCalling
VariantAnnota4on
Node2
Node3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node4
Recalibrate
![Page 27: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/27.jpg)
27©Cloudera,Inc.Allrightsreserved.
WhyAreWeSJllDefiningFileFormatsByHand?
• InsteadofdefiningcustomfileformatsforeachdatatypeandaccesspaLern…
• ParquetcreatesacompressedformatforeachAvro-defineddatamodel
• ImprovementsoverexisJngformats• ~20%forBAM• ~90%forVCF
![Page 28: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/28.jpg)
28©Cloudera,Inc.Allrightsreserved.
YARN-managedHadoopcluster
Sparkexecutors
Par4alsums
Driver
Applica4oncode
ContEstAlgorithm
![Page 29: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/29.jpg)
29©Cloudera,Inc.Allrightsreserved.
HadoopprovideslayeredabstracJonsfordataprocessing
HDFS(filesystem)
YARN(resourcemanagement)
MapReduce Impala(SQL) Solr(search) Spark
ADAMquince GATK …
bdg-form
ats(Avro/Parque
t)
Hbase(NoSQL)Kudu(columnstore)
![Page 30: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/30.jpg)
30©Cloudera,Inc.Allrightsreserved.
• HostedatBerkeleyandtheAMPLab
• Apache2License• ContributorsfrombothresearchandcommercialorganizaJons
• CorespaJalprimiJves,variantcalling
• AvroandParquetfordatamodelsandfileformats
Spark+Genomics=ADAM
![Page 31: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/31.jpg)
31©Cloudera,Inc.Allrightsreserved.
ExecuJngqueryinHadoop:interacJveSparkshell(ADAM)
def inDbSnp(g: Genotype): Boolean = true or false!def isDeleterious(g: Genotype): Boolean = g.getPolyPhen!
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()!val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()!val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)!val genotypesRDD = sc.adamLoad("path/to/genotypes")!
val filteredRDD = genotypesRDD! .filter(!inDbSnp(_))! .filter(isDeleterious(_))! .filter(isFramingham(_))!val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)!
val maf = joinedRDD! .keyBy(x => (x.getVariant, getPopulation(x)))! .groupByKey()! .map(computeMAF(_))!
maf.saveAsNewAPIHadoopFile("path/to/output")!
applypredicates
loaddata
joindata
group-byaggregate(MAF)
persistdata
![Page 32: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/32.jpg)
32©Cloudera,Inc.Allrightsreserved.
ExecuJngqueryinHadoop:distributedSQL
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
applypredicates
“load”andjoindata
group-by
aggregate(UDAF)
![Page 33: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/33.jpg)
33©Cloudera,Inc.Allrightsreserved.
ADAMpreliminaryperformance
![Page 34: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/34.jpg)
34©Cloudera,Inc.Allrightsreserved.
• DevelopedbytheBroadInsJtute• CoreisMITlicense,someproprietarytoolsontop
• Version4hasbeenre-wriLentouseSpark,nowcompeJJvewithADAMforspeed
• UsesexisJngbiofileformatsforinputandoutput,butSparkRDDsforintermediatedata
GenomeAnalysisToolkit(GATK)
![Page 35: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/35.jpg)
35©Cloudera,Inc.Allrightsreserved.
• KudufillsgapbetweenHDFSandHBase
• Fastscansandupdateable• AddnewannotaJonstogenomicsdata(variants)withoutrewriJngwholedataset
• Key=genomeposiJon
• RangeparJJoning
KuduforVariantStores
![Page 36: Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)](https://reader035.fdocuments.us/reader035/viewer/2022081611/5f09c8087e708231d42877b9/html5/thumbnails/36.jpg)
36©Cloudera,Inc.Allrightsreserved.
Acknowledgements
UCBerkeleyMaLMassieFrankNothauMichaelHeuer
TamrTimothyDanford
MSSMJeffHammerbacherRyanWilliams
ClouderaUriLasersonSandyRyzaSeanOwen