Mongo db and_academia

Post on 01-Jul-2015

426 views 0 download

Transcript of Mongo db and_academia

MongoDB and academiaJan Aerts, PhD

Wellcome Trust Sanger InstituteHinxton, UK

jan.aerts@gmail.com@jandot

Disclaimer 1

Disclaimer 2

Acknowledgments

MongoDB community

Caren Brockington

10gen

transcriptomics

genomics

proteomics

*omics

transcriptomics

genomics

proteomics

*omics

instantiationomics

metabolomics

spliceomics

interactomics

metallomics

lipidomics

orfeomics

phenomicshistomics

Academia != industry

heterogeneous systems

transitory

little optimization

slow adoption of new technology

(don't break anything that works)

data management = afterthought

money

Who are the players?

large genome/data centers

genome hackers(lone bioinformaticians)

bench-based scientists

Drawings by Morag Ann Lewis

genome hackers (lone bioinformaticians)

bench-based scientists

heavy investment in infrastructure/pipelines

data exchange => standards!

large genome/data centers

genome hackers (lone bioinformaticians)

bench-based scientists

little investment in infrastructure

little time/effort for optimization

one-off

getting it donecreating legacy

need IT support for heavier work

large genome/data centers

often self-taught

large genome/data centers

genome hackers (lone bioinformaticians)

bench-based scientistsuse whatever everyone else is using

"normalization?"

The data landscape

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

1. Flat text filesLOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) genes, complete cds.VERSION    U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's

yeast) ORGANISM   Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; 

           Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE  1 (bases 1 to 5028)AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE      Cloning and sequence of REV7, a gene whose function is required for DNA            damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL    Yeast 10 (11), 1503-1509 (1994)PUBMED     7871890FEATURES   Location/Qualifiers  gene     687..3158             /gene="AXL2" gene complement(3300..4037)             /gene="REV7"ORIGIN       1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg            61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct           121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa           181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg           241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa           301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa           361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat           421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS      SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2)            and Rev7p (REV7) ...

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA000011   967433 . G A   151.43  0   AB=0.42;AC=1               GT:DP:GQ  1/0:11:99.001   970323 . G A   492.61  0   AB=0.41;AC=1;AF=0.50       GT:DP:GQ     1/0:28:99.001   970950 . A G  1287.90  0   AB=0.55;AC=1;AF=0.50       GT:DP:GQ     0/1:108:99.001  972804 . T C   210.56  0   AB=0.53;AC=1;AF=0.50  GT:DP:GQ     1/0:13:99.001  972857 . T C   846.18  0   AB=0.53;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:58:99.001   974165 . T C   810.47  0   AB=0.38;AC=1;AF=0.50;AN=2  GT:DP:GQ     1/0:6:67.051   977063 . C T  1110.31  0   AB=0.50;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:67:99.001  1006892 . C G    62.39  SF  AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:2:6.021  1148494 . A G  5237.88  0   AC=2;AF=1.00;AN=2   GT:DP:GQ     1/1:160:99.001  1149380 . T C   165.10  0   AC=2;AF=1.00;AN=2          GT:DP:GQ   1/1:6:18.051  1212553 . C T   426.61  0   AB=0.26;AC=1;AF=0.50;AN=2  GT:DP:GQ  0/1:18:99.001  1235867 . A G  1158.08  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:30:90.281  1237357 . T C    142.01  0   AC=2;AF=1.00;AN=2          GT:DP:GQ  1/1:5:15.041  1239050 . G A 13952.03  0   AC=2;AF=1.00;AN=2          GT:DP:GQ     1/1:340:99.0020 14370 . G A       29  0   NS=58;DP=258;AF=0.786      GT:GQ:DP:HQ  0|0:48:1:51,5120 13330 . T A        3  q10 NS=55;DP=202;AF=0.024      GT:GQ:DP:HQ  0|0:49:3:58,5020 1110696 . A G,T     67  0   AF=0.421,0.579;AA=T;DB     GT:GQ:DP:HQ  1|2:21:6:23,2720 10237 . T .       47  0   NS=57;DP=257;AA=T          GT:GQ:DP:HQ  0|0:54:7:56,60...

perl

java

python

ruby

“tab-delimited” is king

2. Binary compressed flat filesOne experiment

=> One datafile as text: 40-70Gb=> Compressed to 11-20Gb

Toolkits to access data (and generate tab-delimited)

Cjava

3. MySQL and Oracle

Curated dataMeta-dataRaw data: BLOBs

Sequencing:>6 TB/week and growing…

Departmental project:40 individuals x 42mio datapoints/individual=> joins?

Denormalized copy

4. AceDB - A Caenorhabditis elegans database

object-orientedAuthor "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789"  Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993

Challenges in *omics-

Where can MongoDB play a role?

explosion of data

every researcher must be able to handle data

low stepping stone for bench-based scientists big data

Takeoff within research community?widespread?

Cannot manage all data in-house <= data exchange!=> focus more on file formats than on technology

smaller scaleImplement MongoDB for

* local storage and queyring (load file from standard file format into custom DB)

* encourage non-informaticians to use MongoDB

Thank you!Questions?

jan.aerts@gmail.com@jandot

http://saaientist.blogspot.com