keyword search on external memory data bbd/vldb2008.pdf · pdf filekeyword search on...

Click here to load reader

Post on 13-Jul-2018




0 download

Embed Size (px)


  • Keyword Search on External Memory Data Graphs

    Bhavana Bharat Dalvi

    Meghana Kshirsagar

    S. SudarshanComputer Science and Engg. Dept., I.I.T. Bombay

    [email protected], [email protected], [email protected]

    ABSTRACTKeyword search on graph structured data has attracted alot of attention in recent years. Graphs are a natural low-est common denominator representation which can com-bine relational, XML and HTML data. Responses to key-word queries are usually modeled as trees that connect nodesmatching the keywords.

    In this paper we address the problem of keyword searchon graphs that may be significantly larger than memory.We propose a graph representation technique that combinesa condensed version of the graph (the supernode graph)which is always memory resident, along with whatever partsof the detailed graph are in a cache, to form a multi-granulargraph representation. We propose two alternative approacheswhich extend existing search algorithms to exploit multi-granular graphs; both approaches attempt to minimize IOby directing search towards areas of the graph that are likelyto give good results. We compare our algorithms with a vir-tual memory approach on several real data sets. Our experi-mental results show significant benefits in terms of reductionin IO due to our algorithms.

    1. INTRODUCTIONKeyword search on graph structured data has attracted a

    lot of attention in recent years. Graphs are a natural low-est common denominator representation which can com-bine relational, XML and HTML data. Graph representa-tions are particularly natural when integrating data frommultiple sources with different schemas, and when integrat-ing information extracted from unstructured data. Personalinformation networks (e.g. [7]) which combine informationfrom different sources available to a user, such as email, doc-uments, organizational data and social networks, can also benaturally represented as graphs. Keyword querying is par-ticularly important in such scenarios.

    Responses to keyword queries are usually modeled as trees

    Current affiliation: Google Inc., BangaloreCurrent affiliation: Yahoo! Labs, Bangalore

    Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB 08, August 24-30, 2008, Auckland, New ZealandCopyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

    that connect nodes matching the keywords. Each answertree has an associated score based on node and edge weights,and the top K answers have to be retrieved.

    There has been a great deal of work on keyword query-ing of structured and semi-structured data in recent years.Several of these approaches are based on in-memory graphsearch. This category includes the backward expanding search[3], bidirectional search [16], dynamic programming tech-nique DPBF [9], and BLINKS [13]. All these algorithmsassume that the graph is in-memory.

    As discussed in [3, 16], since the graph representation doesnot have to store actual textual data, it is fairly compact andgraphs with millions of nodes, corresponding to hundreds ofmegabytes of data, can be stored in tens of megabytes ofmain memory. If we have a dedicated server for search,even larger graphs such as the English Wikipedia (whichcontained over 1.4 million nodes and 34 million links (edges)as of October 2006) can be handled. However, a graph ofthe size of the Web graph, with billions of nodes, will not fitin memory even on large servers.

    More relevantly, integrated search of a users personal in-formation, coupled with organizational and Web informa-tion is widely viewed as an important goal (e.g. [7]). Forreasons of privacy, such search has to be performed on ausers machine, where many applications contend for avail-able memory; it may not be feasible to dedicate hundreds ofmegabytes of memory (which will be needed with multipledata sets) for occassionally used search.

    A naive use of in-memory algorithms on very large graphsmapped to virtual memory would result in a very significantIO cost, as can be seen from the experimental results inSection 7. Related work is described in more detail inSection 2. None of the earlier work has addressed the caseof keyword search on very large graphs.

    In this paper we address the problem of keyword search ongraphs that may be significantly larger than memory. Ourcontributions are as follows:

    1. We propose (in Section 4) a multi-granular graph rep-resentation technique, which combines a condensed ver-sion of the graph (the supernode graph) which isalways memory resident, along with whatever partsof the detailed graph are in a cache. The supernodegraph is formed by clustering nodes in the full graphinto supernodes, with superedges created between su-pernodes that contain connected nodes. The multi-granular graph represents all information about thepart of the full graph that is currently available inmemory.


    Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than VLDB Endowment must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or [email protected] PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08

  • The idea of multi-level hierarchical graphs formed byclustering nodes or edges of a graph has been usedin many contexts, as outlined in Section 2, and hi-erarchical multi-granular graph views have been usedfor visualization (e.g. [6]). However, as far as we areaware, earlier work on search (including keyword andshortest path search search) does not exploit the cachepresence of parts of the detailed graph.

    2. We propose two alternative approaches which extendexisting search algorithms to exploit multi-granulargraphs; both approaches attempt to minimize IO bydirecting search towards areas of the graph that arelikely to give good results.

    The first approach, the iterative approach (Section 5),performs search on the multi-granular graph. The an-swers generated could contain supernodes. Such supernodes are expanded, and the search algorithm is ex-ecuted again on the new graph state, till the top-Kanswers are found. Any technique for keyword searchon graphs, such as backward expanding, bidirectionalor DPBF, that does not depend on precomputed in-dices can be used within an iteration.

    The second approach, described in Section 6, is anincremental approach, which expands supernodes asbefore, but instead of restarting search from scratch,adjusts the in-memory data structures to reflect thechanged state of the multi-granular graph. This savessignificantly on the CPU cost, and as it turns out,reduces IO effort also. We present an incremental ver-sion of the backward expanding search algorithm of [3],although in principle it should be possible to createincremental versions of other search algorithms suchas bidirectional search [16] or the search technique ofBLINKS [13].

    3. We compare our algorithms and heuristics with a vir-tual memory approach and the sparse approach of[14], on several real data sets. Our experimental re-sults (Section 7) show significant benefits in terms ofreduction in IO due to our algorithms, and our heuris-tics allow very efficient evaluation while giving verygood recall.

    2. RELATED WORKWork on keyword querying can be broadly classified into

    two categories based on how they use schema information:

    1. Schema-based approaches: In these approaches, a sch-ema-graph1of the database is used, along with thequery keywords and text-indices to first generate can-didate or probable answer trees. SQL queries cor-responding to each candidate tree are computed andeach query is executed against the database to getresults. Only some of the original candidate answertrees may produce results finally. DBXplorer [1], DIS-COVER [15], [14], [20] and [21] present algorithmsbased on this model.

    Schema-based approaches are only applicable to query-ing on relational data. Moreover, none of the algo-

    1A graph with relations/tables as nodes, and edges betweentwo nodes if there exists a foreign-key to primary-key rela-tionship between the corresponding tables.

    rithms listed above provides an effective way of gener-ating top-K results in the presence of ranking functionsbased on data-level node and edge weights.

    2. Schema-free approaches: These approaches are appli-cable to arbitrary graph data, not just to relationaldata. Graph data can include node and edge weights,and answers are ranked on these weights. The goal ofthese algorithms is to generate top-k answers in rankorder. Algorithms in this category include RIU [19],backward expanding search and bidirectional search[3, 16], the dynamic programming algorithm DPBF[9], and BLINKS [13]. All the above algorithms as-sume either implicitly or explicitly that the graph is in-memory, and would incur high random-IO overheadsif the graph (as well as the associated bi-level index inthe case of BLINKS) does not fit in memory.

    A graph search technique based on merging of postinglist