Mongo db and_academia

34
MongoDB and academia Jan Aerts, PhD Wellcome Trust Sanger Institute Hinxton, UK [email protected] @jandot

Transcript of Mongo db and_academia

Page 1: Mongo db and_academia

MongoDB and academiaJan Aerts, PhD

Wellcome Trust Sanger InstituteHinxton, UK

[email protected]@jandot

Page 2: Mongo db and_academia

Disclaimer 1

Page 3: Mongo db and_academia

Disclaimer 2

Page 4: Mongo db and_academia

Acknowledgments

MongoDB community

Caren Brockington

10gen

Page 5: Mongo db and_academia
Page 6: Mongo db and_academia

transcriptomics

genomics

proteomics

*omics

Page 7: Mongo db and_academia

transcriptomics

genomics

proteomics

*omics

instantiationomics

metabolomics

spliceomics

interactomics

metallomics

lipidomics

orfeomics

phenomicshistomics

Page 8: Mongo db and_academia

Academia != industry

Page 9: Mongo db and_academia

heterogeneous systems

Page 10: Mongo db and_academia

transitory

Page 11: Mongo db and_academia

little optimization

Page 12: Mongo db and_academia

slow adoption of new technology

(don't break anything that works)

Page 13: Mongo db and_academia

data management = afterthought

money

Page 14: Mongo db and_academia

Who are the players?

Page 15: Mongo db and_academia

large genome/data centers

genome hackers(lone bioinformaticians)

bench-based scientists

Drawings by Morag Ann Lewis

Page 16: Mongo db and_academia

genome hackers (lone bioinformaticians)

bench-based scientists

heavy investment in infrastructure/pipelines

data exchange => standards!

large genome/data centers

Page 17: Mongo db and_academia

genome hackers (lone bioinformaticians)

bench-based scientists

little investment in infrastructure

little time/effort for optimization

one-off

getting it donecreating legacy

need IT support for heavier work

large genome/data centers

often self-taught

Page 18: Mongo db and_academia

large genome/data centers

genome hackers (lone bioinformaticians)

bench-based scientistsuse whatever everyone else is using

"normalization?"

Page 19: Mongo db and_academia

The data landscape

Page 20: Mongo db and_academia

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

Page 21: Mongo db and_academia

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

Page 22: Mongo db and_academia

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

Page 23: Mongo db and_academia

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

Page 24: Mongo db and_academia

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

perl

java

python

ruby

“tab-delimited” is king

Page 25: Mongo db and_academia

2. Binary compressed flat filesOne experiment

=> One datafile as text: 40-70Gb=> Compressed to 11-20Gb

Toolkits to access data (and generate tab-delimited)

Cjava

Page 26: Mongo db and_academia

3. MySQL and Oracle

Curated dataMeta-dataRaw data: BLOBs

Sequencing:>6 TB/week and growing…

Departmental project:40 individuals x 42mio datapoints/individual=> joins?

Denormalized copy

Page 27: Mongo db and_academia

4. AceDB - A Caenorhabditis elegans database

object-orientedAuthor "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789"  Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993

Page 28: Mongo db and_academia
Page 29: Mongo db and_academia

Challenges in *omics-

Where can MongoDB play a role?

Page 30: Mongo db and_academia

explosion of data

every researcher must be able to handle data

Page 31: Mongo db and_academia

low stepping stone for bench-based scientists big data

Page 32: Mongo db and_academia
Page 33: Mongo db and_academia

Takeoff within research community?widespread?

Cannot manage all data in-house <= data exchange!=> focus more on file formats than on technology

smaller scaleImplement MongoDB for

* local storage and queyring (load file from standard file format into custom DB)

* encourage non-informaticians to use MongoDB

Page 34: Mongo db and_academia

Thank you!Questions?

[email protected]@jandot

http://saaientist.blogspot.com