distributed searching in biological · pdf file biological databases storing dna sequences,...

Click here to load reader

Post on 29-May-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Distributed Searching in

    Biological Databases

    by

    Dominic Battré, B.Comp.Sc.

    Thesis

    Presented to the Faculty of the

    School of Computer Science,

    Telecommunications and Information Systems,

    DePaul University

    in Partial Fulfillment

    of the Requirements

    for the Degree of

    Master of Science

    DePaul University

    October 2005

  • ii

  • Distributed Searching in

    Biological Databases

    APPROVED BY

    SUPERVISING COMMITTEE:

    David Angulo

    Massimo DiPierro

    Ljubomir Perkovic

  • iv

  • Acknowledgement

    First, I want to thank my supervisor, David Angulo of DePaul University, for providing me with the support and the space to pursue my ideas and interests throughout my whole course of studies.

    Furthermore, I want to express my gratitude to the Paderborn Center for Parallel Computing, esp. Prof. Dr. Odej Kao and Axel Keller, for providing me access to a 96 node cluster and thereby facilitating this research. The help Dr. Rajeev Thakur of the Argonne National Lab has provided on the new multi–threading support of MPICH 2 was very valuable. I am thankful for ideas and comments of the researchers of the Illinois Biogrid, Dr. Gregor von Laszewski, and Dr. Nicholas Karonis of the Argonne National Lab.

    I want to thank the Fulbright Commission and the German National Academic Foundation, which supported and facilitated my studies at DePaul University with scholarships.

    Finally, and foremost, I want to thank my family for their love and constant support.

    Chicago, IL, USA, October 2005 Dominic Battré

    v

  • vi

  • Distributed Searching in

    Biological Databases

    by

    Dominic Battré, M.S. DePaul University

    SUPERVISOR: David Angulo

    This thesis addresses the problem of searching huge biological databases on the scale of several gigabytes by utilizing parallel processing. Biological databases storing DNA sequences, protein sequences, or mass spectra are growing exponentially. Searches through these databases consume exponentially growing computational resources as well. The thesis demonstrates and analyzes a general use, MPI based, C++ framework for generically splitting databases amongst several computational nodes. The combined RAM of the nodes working in tandem is often sufficient to keep the entire database in memory, and therefore to search it efficiently without paging to disk. The framework runs as a persistent service, processing all submitted queries. This allows for query reordering and better utilization of the memory. Thereby, we achieve superlinear speedups compared to single processor implementations. We demonstrate the utility and speedup of the framework using a real biological database and an actual searching algorithm for Mass Spectrometry.

  • viii

  • Contents

    1 Problem description 1

    1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 DNA databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Protein databases . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Mass spectrometry databases . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2 The Master/Worker Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Analysis of speedups . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Parallelizing sequence database searches . . . . . . . . . . . . . . 11

    1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2 Related research 15

    2.1 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Condor MW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 AMWAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.3 Java Distributed System . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.4 The Organic Grid . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.5 Other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.2 Specific implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 mpiBLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 pioBLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.3 Multi–Ring Buffer Management for BLAST . . . . . . . . . . . . 33 2.2.4 BRIDGES GT3 BLAST . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.5 NCBI BLAST website . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    ix

  • 3 Approaches to the problem 37

    3.1 High level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 Servicing style . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 Application topology . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.3 Result aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.4 Internal parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.5 External interface . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1 Query structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Query flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.3 Database Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 Database model . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.2 Partitioning of responsibilities . . . . . . . . . . . . . . . . . . . . 56 3.3.3 Refinement of responsibilities . . . . . . . . . . . . . . . . . . . . 57 3.3.4 Number of databases . . . . . . . . . . . . . . . . . . . . . . . . . 62

    3.4 Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4 Implementation 65

    4.1 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.1 Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.3 SOAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.4 Apache log4cxx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2 API and Internals of Modules . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 TF::Utils namespace . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 TF::Serialization namespace . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 TF::Thread namespace . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.4 TF::Messages namespace . . . . . . . . . . . . . . . . . . . . . . 75 4.2.5 TF::Query namespace . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.6 TF::Topology namespace . . . . . . . . . . . . . . . . . . . . . . 81 4.2.7 TF::Coordination namespace . . . . . . . . . . . . . . . . . . . . 83 4.2.8 TF::Master namespace . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.9 TF::Worker namespace . . . . . . . . . . . . . . . . . . . . . . . . 85

    x

  • 4.2.10 TF::Aggregation namespace . . . . . . . . . . . . . . . . . . . . . 86 4.2.11 TF::Interfaces namespace . . . . . . . . . . . . . . . . . . . . . . 86 4.2.12 TF::ApplicationModel namespace . . . . . . . . . . . . . . . . . . 87

    4.3 Sample Application (Spectral Comparison) . . . . . . . . . . . . . . . . 89 4.3.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.4 Web application . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    4.4 Installation and dependencies . . . . . . . . . . . . . . . . . . . . . . . . 103

    5 Analysis 105

    5.1 Methodologies and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.1 Query submission . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.2 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1.3 Execution control and report generation . . . . . . . . . . . . . . 107

    5.2 Homogeneous Environments with One Database . . . . . . . . . . . . . 108 5.2.1 Thrashing effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.2 Database loading time . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.3 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    5.3 Refinement of Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.1 Slow node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.2 Thermal problems . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3.3 Inhomogeneous database partitions . . . . . . . . . . . . . . . . . 121

    5.4 Multiple Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    6 Conclusion 127

    xi

  • xii

  • List of Figures

    1.1.1 FASTA Format [25] . . . . . .

View more