bidirectional expansion for keyword search on graph databases varun kacholia shashank pandit...

Click here to load reader

Post on 24-Dec-2015




0 download

Embed Size (px)


  • Slide 1
  • Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan Rushi Desai Hrishikesh Karambelkar
  • Slide 2
  • Keyword Search on Graph Representation of Data Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank, etc. Need to find a (closely) connected set of nodes that together match all given keywords Focus of our work Search algorithms to find connections between nodes
  • Slide 3
  • Outline Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions
  • Slide 4
  • Graph Data Model Data modeled as a directed weighted graph: BANKS [ICDE02] Can model relational, XML, HTML, etc. data E.g., DBLP database Node = tuple Edge = foreign key reference Multi-Query Optimization SudarshanPrasan Roy writes author paper Soumen BANKS: Keyword search
  • Slide 5
  • Graph Data Model (2) E.g., XML data Databases Keyword Search Databases title proceedings paper (@id = 1) paper (@id = 2) cite
  • Slide 6
  • Response Model Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS Multi-Query Optimization Sudarshan Prasan Roy writes author paper E.g., Sudarshan Roy
  • Slide 7
  • Response Ranking Edge Score = E A Smaller tree => higher score E.g., BANKS: E A = 1/ ( edge weights) Node Score = N A Measure of authority of nodes in tree E.g., BANKS: N A = (leaf and root node authorities) Overall score = f (E A, N A ) E.g., BANKS: f (E A, N A ) E A. N A
  • Slide 8
  • Finding Answer Trees Backward Expanding Search: BANKS [ICDE02] Intuition: travel backwards from keyword nodes till you hit a common node SudarshanPrasan Roy authors MultiQuery Optimization paper Query: sudarshan roy writes
  • Slide 9
  • Backward Search: Algorithm Algorithm Run concurrent single source shortest path iterators from each node matching a keyword Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output node if in the intersection of sets of nodes reached from each keyword
  • Slide 10
  • Backward Search: Limitations Wasteful exploration of graph: Frequently occurring keywords Hub nodes in the graph (high in-degree) Database Shashank Sudarshan author paper writes Schema Legend Shashank Sudarshan Database
  • Slide 11
  • Bidirectional Search: Motivation
  • Slide 12
  • Bidir Search: Intuition First cut solution: Dont go backward if keyword matches many nodes Dont go backward if node points to a hub Instead explore forward from other keywords
  • Slide 13
  • Bidir Search: Example author paper writes Schema Legend Shashank Sudarshan Database Database Shashank Sudarshan
  • Slide 14
  • Bidir Search: Issues What should threshold for not expanding be? Our solution: prioritize expansion of nodes based on spreading activation to penalize frequent keywords and bushy trees How to manage exploration in both directions?
  • Slide 15
  • Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Every node given an initial activation Gives low activation to frequently occurring keywords John 1/5
  • Slide 16
  • Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Activation spread to neighbors ( = 0.3) Gives low activation to neighbors of hubs 0.3 x 1/5 0.7 x 1/5 x 1/4 1/5 0 0 0 00.7 x 1/5 x 1/4 1 1 1 1
  • Slide 17
  • How to manage exploration in both directions? Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes Details in full paper Bidir Search: Iterators 1 [0,][0,][,0][,0] [1,][1,] [,][,] [,1][,1][,1][,1] AB [2,3 ][2,3 ] 7 3 2 4 5 6 [, 2][, 2] [Dist from A, Dist from B] [2,][2,]
  • Slide 18
  • Bidir Search: Algorithm Algorithm Activate matching nodes; insert into backward iterator while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures) Propagate values to ancestors if necessary Insert nodes explored in the backward direction into the forward iterator /* for future forward exploration */ Stop when top-k results are produced
  • Slide 19
  • Bidir Search: top-k results Results need not be generated in-order Nave solution Store results in an intermediate heap Output top k results after mk total results have been generated (m ~ 10) Can do better Compute upper bound on score of next result; output answers with a higher score Similar to NRA algorithm (Fagin et al., PODS01)
  • Slide 20
  • Experiments Datasets DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges Workload Keywords randomly picked from results of SQL join statements Search algorithms MI-Bkwd: original backward search Iterator for every node matching a keyword SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Time taken/nodes explored Measured when 10 th answer is generated (or last answer if #answers < 10) Origin size #nodes matched by keywords in the query
  • Slide 21
  • Experiments (2) MI-Bkwd versus SI-Bkwd SI-Bkwd gain increases with origin size, # keywords
  • Slide 22
  • Experiments (3) SI-Bkwd versus Bidirec Bidirec gain increases with origin size, # keywords
  • Slide 23
  • Experiments (4) Precision/Recall experiments Relevant answers are well-defined; can be generated through SQL statements Both MI-Backward and Bidirectional show similar performance Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before generating all relevant answers Bidirectional runs faster, yet minimal loss of relevant results!
  • Slide 24
  • Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB03] Generate join expressions leading to query results Use DB-provided scores for ranking tuples and aggregate them to rank answer trees For top-k results: automatically determine required number of join expressions Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a lot more!) Bidirectional versus Sparse-LB Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)
  • Slide 25
  • Experiments (6) SI-Bkwd versus Bidirec: by origin size Bidirec gains more with unbalanced origin sizes A = (T,S,S,S) B = (M,M,M,M) C = (M,L,L,L) D = (M,M,L,L) E = (T,L,L,L) F = (T,S,M,L) G = (T,M,L,L) H = (T,T,T,L)
  • Slide 26
  • Discussion Bidirectional search as dynamic per-tuple join ordering Related work in this area: Eddies Bidirectional search Schema-less Prioritization based on activation instead of selectivity Generate answers in relevance order
  • Slide 27
  • Related Work Keyword querying on relational data: Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory data structures Issues: generate join plans, re-use common sub-expressions, etc. Keyword querying on XML XRank (Cornell), Schema-Free XQuery (Michigan), Tree model is too limited ObjectRank
  • Slide 28
  • Conclusions Graph model Convenient common denominator representation Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading activation performs much better Dynamically choose join order on per-tuple basis
  • Slide 29
  • Thank You! Questions??
  • Slide 30
  • Future of Keyword Search in DBs Next generation of intelligent search will require context information E.g. search email, files, calendar,.. Information integration will be important Graph structured data will be a key component Is there a killer app? Deep web? Display of answers Users dont want to see schema details Can we leverage off existing (Web) apps?
  • Slide 31
  • BANKS Future Work Applications of BANKS Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of data Extract information, Infer soft links BANKS for personal information management SPIN: Search Personal Information Networks Ongoing/future work on BANKS: More sysadmin/user control on ranking One size does not fit all BANKS provides infrastructure Characterize bidirectional search better And find other applications Security
  • Slide 32
  • Bidir Search: top-k results (2) Compute upper bound on score of next result; output answers with a higher score Computing the bound m i = minimum path length explored backward from keyword i unseen answer node: 1/(m 1 + m 2 + + m n ) visited answer node: suppose reached from first x keywords with distance d i 1/[(d 1 + d 2 + + d x ) + (m x+1 + m x+2 + + m n )] combine this with max node prestige We simply use 1/(m 1 + m 2 + + m n ) ! Experiments show no significant loss in using this heuristic