keyword search on graph-structured data s. sudarshan iit bombay joint work with soumen chakrabarti,...
TRANSCRIPT
Keyword Search on Graph-Structured Data
S. Sudarshan IIT Bombay
Joint work withSoumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar
Jan 2009
2
Outline
Motivation and Graph Data Model Query/Answer models
Tree answer model Proximity queries
Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs
Conclusion
3
Keyword Search on Semi-Structured Data
Keyword search of documents on the Web has been enormously successful
Much data is resident in databases Organizational, government, scientific,
medical data Deep web
Goal: querying of data from multiple data sources, with different data models Often with no schema or partially
defined schema
4
Keyword Search on Structured/Semi-Structured Data
Key differences from IR/Web Search: Normalization (implicit/explicit) splits related
data across multiple tuples To answer a keyword query we need to find a
(closely) connected set of entities that together match all given keywords
soumen crawling or soumen byron
Focused Crawling …
Soumen C. Byron Dom
writes
author
paper
Sudarshan
BANKS: Keyword search…
5
Graph Data Model
Lowest common denominator across many data models Relational
node = tuple, edge = foreign key XML
Node = element, edge = containment/idref/keyref HTML
node = page, edge = hyperlink Documents
node = document, edge = links inferred by data extraction
Knowledge representation node = entity, edge = relationship
Network data e.g. social networks, communication networks
6
Graph Data Model (Cont)
Nodes can have labels
E.g. relation name, or XML tag textual or structured (attribute-value)
data Edges can have labels
7
Outline
Motivation and Graph Data Model Query/Answer models
Tree answer model Proximity queries
Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs
Conclusion
8
Query/Answer Models
Basic query model: Keywords match node text/labels
Can extend query model with attribute specification, path specifications
e.g. paper(year < 2005, title:xquery),
Alternative answer models tree connecting nodes matching query
keywords nodes in proximity to (near) query
keywords
9
Tree Answer Model
Answer: Rooted, directed tree connecting keyword nodes In general, a Steiner
tree Multiple answers
possible Answer relevance
computed from answer edge score
combined with answer node score
Focused Crawling
Soumen C. Byron Dom
writes writes
author author
paper
Eg. “Soumen Byron”
10
Answer Ranking Naïve model: answers ranked by number
of edges Problem:
Some tuples are connected to many other tuples
E.g. highly cited papers, popular web sites Highly connected tuples create misleading
shortcuts six degrees of separation
Solution: use directed edges with edge weights allow answer tree to have edge uv if original
graph has vu, but at higher cost
11
Edge Weight Model-1
Forward edge weight (edge present in data) Default to 1, can be based on schema Lower weight closer connection
Create extra backward edges vu for each edge uv present in data
Edge weight log(1+#edges pointing to v) Overall Answer-tree Edge Score EA = 1/ ( edge weights)
Higher score better result
1
1
1
3
3
3
12
Edge Weight Model -2
Probabilistic edge scoring model Edge traversal probability (from a given node):
Forward 1/out-degree Backward 1/in-degree Can be weighted by edge type
Path weight = probability of following each edge in path
Edge score = log(edge traversal probability) Answer-tree Edge Score EA = (harmonic)
mean of path weights from root to each leaf
Note: other edge weight models possible our search algorithms are independent of how edge
weights are computed
13
Node Weight Node prestige based on indegree
More incoming edges higher prestige PageRank style transfer of prestige
Node weight computing using biased random walk model
Node weight: function of node prestige, other optional criteria such as TF/IDF
Answer-tree Node score NA = root node weight + leaf node weights
14
Overall Tree Answer Score
Overall score of answer tree A combine tree and node scores for details, and recall/precision metrics see
BANKS papers in ICDE 2002 and VLDB 2005 Anecdotal results on DBLP Bibliography
“Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations)
“soumen sudarshan”: several coauthored papers, links via common co-authors
“goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections
15
Answer Models
Tree Answer Model Proximity (near query) model
16
Proximity Queries
Node weight by proximity author (near olap) (on DBLP) faculty (near earthquake) (on IITB thesis
database) Node prestige > if close to multiple
nodes matching near keywords Example applications
Finding experts on a particular area
RaghuWidomOLAP over uncertain ..
Computing sparse cubes…Overview of OLAP…Allocation in OLAP …
17
Proximity via Spreading Activation
Idea: Each “near” keyword has activation of 1
Divided among nodes matching keyword, proportional to their node prestige
Each node keeps fraction 1-μ of its received activation and spreads fraction μ amongst its neighbors
Combine activation ai received from neighbors
a = 1 – Π(1-ai) (belief function)
Graph may have cycles Iterate till convergence
18
Example Answers
Anecdotal results on DBLP Bibliography author (near recovery): Dave Lomet, C.
Mohan, etc sudarshan(near change): Sudarshan
Chawate sudarshan(near query): S. Sudarshan
Queries can combine proximity scores with tree scores hector sudarshan(near query) vs.
hector sudarshan author(near transactions) data integration
19
Related Work
Proximity Search Goldman, Shivakumar,
Venkatasubramanian and Garcia-Molina [VLDB98]
Considers only shortest path from each node, aggregates across nodes
Our version aggregates evidence from alternative paths
E.g. author (near “Surajit Chaudhuri”)
Object Rank [VLDB04] Similar idea to ours, precomputed
20
Related Work
Keyword querying on relational databases DBExplorer (Microsoft, ICDE02) Discover (UCSD,
VLDB02, VLDB03), Use SQL generation, not applicable to arbitrary graphs ranking based only on #nodes/edges
Keyword querying on XML : Tree Model XRank (Cornell, SIGMOD03), proximity in XML (AT&T
Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04),
Tree model is too limited Keyword querying on XML: Graph Model
XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05)
ranking based only on #nodes/edges
21
Outline
Motivation and Graph Data Model Query/Answer models
Tree answer model Proximity queries
Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs
Conclusion
22
Finding Answer Trees
Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): Intuition: find vertices from which a forward
path exists to at least one node from each Si. Run concurrent single source shortest path
algorithm from each node matching a keyword Create an iterator for each node matching a keyword
Traverse the graph edges in reverse direction Output next nearest node on each get-next() call
Do best-first search across iterators Output an answer when its root has been reached
from each keyword Answer heap to collect and output results in score order
23
Backward Expanding Search
Soumen C. Byron Domauthors
Focused Crawlingpaper
Query: soumen byron
writes
24
Backward Exp. Search Limitations
Too many iterators (one per keyword node) Solution: single iterator per keyword (SI-Bkwd search)
tracks shortest path from node to keyword Changes answer set slightly
Different justifications for same root may be lost Not a big problem in practice
Nodes explored for different keywords can vary greatly E.g. “mining” or “query” vs “knuth” High fan-out when traversing backwards from some
nodes Connection with join ordering
Similar to traversing backwards from all relations that have selections
25
Bidirectional Search: Motivation
26
Bidirectional Search: Intuition
First cut solution: Don’t expand backward if keyword matches
many nodes Instead explore forward from other keywords
Problems Doesn’t deal with high fan-out during search What should cutoff for not expanding be?
Better solution: [Kacholia et al, VLDB 2005] Perform forward search from all nodes reached Prioritize expansion of nodes based on
path weight (as in backward expanding search) + spreading activation
to penalize frequent keyword and bushy trees
27
Bidirectional Search: Example
Query: harper divesh olap
DiveshHarper
OLAP
28
Bidirectional Search (1)
Spreading activation to prioritize backward search (Different from spreading activation for
near queries) Lower weight edges get higher share of
activation Nodes prioritized by sum of activations
Single forward iterator
29
Bidirectional Search (2)
Forward search iterator Forward search from all nodes reached by
backward search Track best forward path to each keyword
Initially infinite cost Whenever this changes, propagate cost change to
all affected ancestors
k2k1
2,∞
∞,∞∞,2
2,2
30
Bidirectional Search (3)
On each path length update (due to backward or forward search) Check if node can reach all keywords If so, add it to output heap
When to output nodes from heap For each keyword Ki, track Mi
Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree
Edge score bounds: What is the best possible edge score of a future
answer? Bounds similar to NRA algorithm (Fagin) Cheaper bounds (e.g. 1/Max(Mi)) or
heuristics (e.g. 1/Sum(Mi)) can be used Output answer if its score is > overall score
upper bound for future answers
31
Performance
Worst case complexity: polynomial in size of graph But for typical (average) case, even linear is
too expensive Intuition: typical query should access only
small part of graph Studied experimentally
Datasets: DBLP, IMDB, US Patent Queries: manually created Typical cases
< 1 second to generate answer 10K-100K nodes explored
32
Performance Results Two versions of backward search:
Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd)
Origin size: number of nodes matching keywords
Time ratio MI/SI
Very minor loss in recall
33
Performance Results
SI-Bkwd versus Bidirectional search
Bidirectional search gain increases with origin size, # keywords
34
Related Work (1)
Publish as document approach Gather related data into a (virtual) document
and index the document (Su/Widom, IDEAS05)
Positives Avoids run-time graph search Works well for a class of applications
E.g. Bibliographic data DBLP page per author
Negatives Not all connections can be captured Duplication of data across multiple documents High index space overhead
35
Related Work (2)
DPBF (Ding et al., ICDE07) dynamic programming technique
exact for top-1 answer, heuristic for top-k
BLINKS (He et al.,SIGMOD07) Round-robin expansion across iterators
Optimal within a factor of m, with m keywords Forward index: node to keyword distance
Used instead of searching forward single level index: impractically large space bi-level index: > main memory IO
efficiency?
36
Outline
Motivation and Graph Data Model Query/Answer models
Tree answer model Proximity queries
Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs
Conclusion
37
External Memory Graph Search
Graph representation quite efficient Requires of < 20 bytes per node/edge
Problem: what if graph size > memory? Alternative 1: Virtual Memory
thrashing Alternative 2 (for relational data): SQL
not good for top-K answer generation across multiple SQL queries
Alternative 3: use compressed graph representation to reduce IO
[Dalvi et al, VLDB 2008]
38
Supernodes and Superedges
39
Multi-granular Graph
Dumb algorithm search on supernode graph get k*F answers, expand their supernodes
into memory, search on resultant graph no guarantees on answers
Better idea: use multi-granular graph Supernode graph in memory Some nodes expanded
Expanded nodes are part of cache Algorithms on multi-granular graph (coming
up)
40
Multi-granular Graph
41
Expanding Nodes
Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer
42
Iterative Search
Iterative search on multi-granular graph Repeat
search on current multi-granular graph using any search algorithm, to find top results
expand super nodes in top results Until top K answers are all pure
Guarantees finding top-K answers Very good IO efficiency
But high CPU cost due to repeated work Details: nodes expanded above never
evicted from “virtual memory” cache
43
Incremental Search
Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph
Run search algorithm until top K answers are all pure
Currently implemented for backward search Modifies the state of the Dijkstra shortest
path algorithm used by backward search One shortest path search iterator per keyword SPI tree: shortest path iterator tree
44
Incremental Search (1)
SPI tree for k1
45
Incremental Search (2)
46
Incremental Search (3)
47
External Memory Search: Performance
Queries
48
External Memory Search: Performance
Supernode graph very effective at minimizing IO Cache misses with incremental often <
# nodes matching keywords Iterative has high CPU cost VM (backward search with cache as
virtual memory) has high IO cost Incremental combines low IO cost
with low CPU cost
49
Conclusions
Keyword search on graphs continues to grow in importance E.g. graph representation of extracted
knowledge in YAGO/NAGA (Max Planck) Ranking is critical
Edge and node weights, spreading activation
Efficient graph search is important In-memory and external-memory
50
Ongoing/Future Work
External memory graph search Compression ratios for supernode graph for
DBLP/IMDB: factor of 5 to 10 Ongoing work on graph clustering shows good
results
Graph search in a parallel cluster Goal: search integrated WWW/Wikipedia graph
New search algorithms Integration with existing applications
To provide more natural display of results, hiding schema details
Authorization
51
BANKS References
Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE 2002
User Interaction in the BANKS System, Demo paper,B. Aditya, Soumen Chakrabarti, Rushi Desai, Arvind Hulgeri, Hrishikesh Karambelkar, Rupesh Nasre, Parag, S. Sudarshan ICDE 2003
Bidirectional Expansion For Keyword Search on Graph Databases, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar,VLDB 2005
Keyword Search on External Memory Data GraphsBhavana Dalvi, Meghana Kshirsagar and S. Sudarshan,VLDB 2008
Thanks!
53
Time and Nodes Explored
BiDir Time
Bidir Nodes
54
Screenshots (1) author (near recovery)
55
Near Queries with Multiple Keywords
Spread activation from each keyword separately
Then combine the activations from different keywords
OR: use addition or belief combination
AND: take product of activations Gives better results
56
The BANKS System
Available on the web, with DBLP, IMDB and IITB ETD data http://www.cse.iitb.ac.in/banks/
No programming needed for customization Minimal preprocessing to create indices and give weights to links
Provides keyword search coupled with extensive browsing features Schema browsing + data browsing Hyperlinks are automatically added to all displayed results Browsing data by grouping and creating crosstabs Graphical display of data: bar charts, pie charts, etc
BANKSUserJDBCHTT
P
Web Server + Servlets
Database
57
BANKS Architecture
Data resident on disk Graph structure of data resident in
memory Nodes and edges with their types/counts
16x|V|+8x|E| bytes Search done in memory
Why Allows us to use interesting graph traversal
based algorithms without being constrained by SQL and related performance issues
With current memory sizes, database graphs for most applications will fit in memory
58
Probabilistic Edge Score Model (2)
Paths from root to leaves are considered separately, even if they share edges More efficient search algorithms with
this models (He et al., SIGMOD07)
0.5 0.5
0.167
1