distributed searching in biological · pdf file biological databases storing dna sequences,...
Post on 29-May-2020
1 views
Embed Size (px)
TRANSCRIPT
Distributed Searching in
Biological Databases
by
Dominic Battré, B.Comp.Sc.
Thesis
Presented to the Faculty of the
School of Computer Science,
Telecommunications and Information Systems,
DePaul University
in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science
DePaul University
October 2005
ii
Distributed Searching in
Biological Databases
APPROVED BY
SUPERVISING COMMITTEE:
David Angulo
Massimo DiPierro
Ljubomir Perkovic
iv
Acknowledgement
First, I want to thank my supervisor, David Angulo of DePaul University, for providing me with the support and the space to pursue my ideas and interests throughout my whole course of studies.
Furthermore, I want to express my gratitude to the Paderborn Center for Parallel Computing, esp. Prof. Dr. Odej Kao and Axel Keller, for providing me access to a 96 node cluster and thereby facilitating this research. The help Dr. Rajeev Thakur of the Argonne National Lab has provided on the new multi–threading support of MPICH 2 was very valuable. I am thankful for ideas and comments of the researchers of the Illinois Biogrid, Dr. Gregor von Laszewski, and Dr. Nicholas Karonis of the Argonne National Lab.
I want to thank the Fulbright Commission and the German National Academic Foundation, which supported and facilitated my studies at DePaul University with scholarships.
Finally, and foremost, I want to thank my family for their love and constant support.
Chicago, IL, USA, October 2005 Dominic Battré
v
vi
Distributed Searching in
Biological Databases
by
Dominic Battré, M.S. DePaul University
SUPERVISOR: David Angulo
This thesis addresses the problem of searching huge biological databases on the scale of several gigabytes by utilizing parallel processing. Biological databases storing DNA sequences, protein sequences, or mass spectra are growing exponentially. Searches through these databases consume exponentially growing computational resources as well. The thesis demonstrates and analyzes a general use, MPI based, C++ framework for generically splitting databases amongst several computational nodes. The combined RAM of the nodes working in tandem is often sufficient to keep the entire database in memory, and therefore to search it efficiently without paging to disk. The framework runs as a persistent service, processing all submitted queries. This allows for query reordering and better utilization of the memory. Thereby, we achieve superlinear speedups compared to single processor implementations. We demonstrate the utility and speedup of the framework using a real biological database and an actual searching algorithm for Mass Spectrometry.
viii
Contents
1 Problem description 1
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 DNA databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Protein databases . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Mass spectrometry databases . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The Master/Worker Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Analysis of speedups . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Parallelizing sequence database searches . . . . . . . . . . . . . . 11
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Related research 15
2.1 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Condor MW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 AMWAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.3 Java Distributed System . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.4 The Organic Grid . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.5 Other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Specific implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 mpiBLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 pioBLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.3 Multi–Ring Buffer Management for BLAST . . . . . . . . . . . . 33 2.2.4 BRIDGES GT3 BLAST . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.5 NCBI BLAST website . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
3 Approaches to the problem 37
3.1 High level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 Servicing style . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 Application topology . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.3 Result aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.4 Internal parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.5 External interface . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1 Query structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Query flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Database Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 Database model . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.2 Partitioning of responsibilities . . . . . . . . . . . . . . . . . . . . 56 3.3.3 Refinement of responsibilities . . . . . . . . . . . . . . . . . . . . 57 3.3.4 Number of databases . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Implementation 65
4.1 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.1 Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.3 SOAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.4 Apache log4cxx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 API and Internals of Modules . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 TF::Utils namespace . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 TF::Serialization namespace . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 TF::Thread namespace . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.4 TF::Messages namespace . . . . . . . . . . . . . . . . . . . . . . 75 4.2.5 TF::Query namespace . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.6 TF::Topology namespace . . . . . . . . . . . . . . . . . . . . . . 81 4.2.7 TF::Coordination namespace . . . . . . . . . . . . . . . . . . . . 83 4.2.8 TF::Master namespace . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.9 TF::Worker namespace . . . . . . . . . . . . . . . . . . . . . . . . 85
x
4.2.10 TF::Aggregation namespace . . . . . . . . . . . . . . . . . . . . . 86 4.2.11 TF::Interfaces namespace . . . . . . . . . . . . . . . . . . . . . . 86 4.2.12 TF::ApplicationModel namespace . . . . . . . . . . . . . . . . . . 87
4.3 Sample Application (Spectral Comparison) . . . . . . . . . . . . . . . . 89 4.3.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.4 Web application . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Installation and dependencies . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Analysis 105
5.1 Methodologies and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.1 Query submission . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.2 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1.3 Execution control and report generation . . . . . . . . . . . . . . 107
5.2 Homogeneous Environments with One Database . . . . . . . . . . . . . 108 5.2.1 Thrashing effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.2 Database loading time . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.3 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3 Refinement of Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.1 Slow node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.2 Thermal problems . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3.3 Inhomogeneous database partitions . . . . . . . . . . . . . . . . . 121
5.4 Multiple Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6 Conclusion 127
xi
xii
List of Figures
1.1.1 FASTA Format [25] . . . . . .