mongo db and_academia
TRANSCRIPT
![Page 1: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/1.jpg)
MongoDB and academiaJan Aerts, PhD
Wellcome Trust Sanger InstituteHinxton, UK
[email protected]@jandot
![Page 2: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/2.jpg)
Disclaimer 1
![Page 3: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/3.jpg)
Disclaimer 2
![Page 4: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/4.jpg)
Acknowledgments
MongoDB community
Caren Brockington
10gen
![Page 5: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/5.jpg)
![Page 6: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/6.jpg)
transcriptomics
genomics
proteomics
*omics
![Page 7: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/7.jpg)
transcriptomics
genomics
proteomics
*omics
instantiationomics
metabolomics
spliceomics
interactomics
metallomics
lipidomics
orfeomics
phenomicshistomics
![Page 8: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/8.jpg)
Academia != industry
![Page 9: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/9.jpg)
heterogeneous systems
![Page 10: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/10.jpg)
transitory
![Page 11: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/11.jpg)
little optimization
![Page 12: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/12.jpg)
slow adoption of new technology
(don't break anything that works)
![Page 13: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/13.jpg)
data management = afterthought
money
![Page 14: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/14.jpg)
Who are the players?
![Page 15: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/15.jpg)
large genome/data centers
genome hackers(lone bioinformaticians)
bench-based scientists
Drawings by Morag Ann Lewis
![Page 16: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/16.jpg)
genome hackers (lone bioinformaticians)
bench-based scientists
heavy investment in infrastructure/pipelines
data exchange => standards!
large genome/data centers
![Page 17: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/17.jpg)
genome hackers (lone bioinformaticians)
bench-based scientists
little investment in infrastructure
little time/effort for optimization
one-off
getting it donecreating legacy
need IT support for heavier work
large genome/data centers
often self-taught
![Page 18: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/18.jpg)
large genome/data centers
genome hackers (lone bioinformaticians)
bench-based scientistsuse whatever everyone else is using
"normalization?"
![Page 19: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/19.jpg)
The data landscape
![Page 20: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/20.jpg)
1. Flat text filesLOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's
yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL Yeast 10 (11), 1503-1509 (1994)PUBMED 7871890FEATURES Location/Qualifiers gene 687..3158 /gene="AXL2" gene complement(3300..4037) /gene="REV7"ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) ...
![Page 21: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/21.jpg)
1. Flat text filesLOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's
yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL Yeast 10 (11), 1503-1509 (1994)PUBMED 7871890FEATURES Location/Qualifiers gene 687..3158 /gene="AXL2" gene complement(3300..4037) /gene="REV7"ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) ...
![Page 22: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/22.jpg)
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
![Page 23: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/23.jpg)
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
![Page 24: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/24.jpg)
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
perl
java
python
ruby
“tab-delimited” is king
![Page 25: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/25.jpg)
2. Binary compressed flat filesOne experiment
=> One datafile as text: 40-70Gb=> Compressed to 11-20Gb
Toolkits to access data (and generate tab-delimited)
Cjava
![Page 26: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/26.jpg)
3. MySQL and Oracle
Curated dataMeta-dataRaw data: BLOBs
Sequencing:>6 TB/week and growing…
Departmental project:40 individuals x 42mio datapoints/individual=> joins?
Denormalized copy
![Page 27: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/27.jpg)
4. AceDB - A Caenorhabditis elegans database
object-orientedAuthor "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789" Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993
![Page 28: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/28.jpg)
![Page 29: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/29.jpg)
Challenges in *omics-
Where can MongoDB play a role?
![Page 30: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/30.jpg)
explosion of data
every researcher must be able to handle data
![Page 31: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/31.jpg)
low stepping stone for bench-based scientists big data
![Page 32: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/32.jpg)
![Page 33: Mongo db and_academia](https://reader035.vdocuments.net/reader035/viewer/2022062320/55933d031a28abe6748b45b2/html5/thumbnails/33.jpg)
Takeoff within research community?widespread?
Cannot manage all data in-house <= data exchange!=> focus more on file formats than on technology
smaller scaleImplement MongoDB for
* local storage and queyring (load file from standard file format into custom DB)
* encourage non-informaticians to use MongoDB