keyword search on graph-structured data s. sudarshan iit bombay joint work with soumen chakrabarti,...

Keyword Search on Graph-Structured Data

S. Sudarshan IIT Bombay

Joint work withSoumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar

Jan 2009

2

Outline

Motivation and Graph Data Model Query/Answer models

Tree answer model Proximity queries

Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs

Conclusion

3

Keyword Search on Semi-Structured Data

Keyword search of documents on the Web has been enormously successful

Much data is resident in databases Organizational, government, scientific,

medical data Deep web

Goal: querying of data from multiple data sources, with different data models Often with no schema or partially

defined schema

4

Keyword Search on Structured/Semi-Structured Data

Key differences from IR/Web Search: Normalization (implicit/explicit) splits related

data across multiple tuples To answer a keyword query we need to find a

(closely) connected set of entities that together match all given keywords

soumen crawling or soumen byron

Focused Crawling …

Soumen C. Byron Dom

writes

author

paper

Sudarshan

BANKS: Keyword search…

5

Graph Data Model

Lowest common denominator across many data models Relational

node = tuple, edge = foreign key XML

Node = element, edge = containment/idref/keyref HTML

node = page, edge = hyperlink Documents

node = document, edge = links inferred by data extraction

Knowledge representation node = entity, edge = relationship

Network data e.g. social networks, communication networks

6

Graph Data Model (Cont)

Nodes can have labels

E.g. relation name, or XML tag textual or structured (attribute-value)

data Edges can have labels

7

Outline




Conclusion

8

Query/Answer Models

Basic query model: Keywords match node text/labels

Can extend query model with attribute specification, path specifications

e.g. paper(year < 2005, title:xquery),

Alternative answer models tree connecting nodes matching query

keywords nodes in proximity to (near) query

keywords

9

Tree Answer Model

Answer: Rooted, directed tree connecting keyword nodes In general, a Steiner

tree Multiple answers

possible Answer relevance

computed from answer edge score

combined with answer node score

Focused Crawling

Soumen C. Byron Dom

writes writes

author author

paper

Eg. “Soumen Byron”

10

Answer Ranking Naïve model: answers ranked by number

of edges Problem:

Some tuples are connected to many other tuples

E.g. highly cited papers, popular web sites Highly connected tuples create misleading

shortcuts six degrees of separation

Solution: use directed edges with edge weights allow answer tree to have edge uv if original

graph has vu, but at higher cost

11

Edge Weight Model-1

Forward edge weight (edge present in data) Default to 1, can be based on schema Lower weight closer connection

Create extra backward edges vu for each edge uv present in data

Edge weight log(1+#edges pointing to v) Overall Answer-tree Edge Score EA = 1/ ( edge weights)

Higher score better result

1

1

1

3

3

3

12

Edge Weight Model -2

Probabilistic edge scoring model Edge traversal probability (from a given node):

Forward 1/out-degree Backward 1/in-degree Can be weighted by edge type

Path weight = probability of following each edge in path

Edge score = log(edge traversal probability) Answer-tree Edge Score EA = (harmonic)

mean of path weights from root to each leaf

Note: other edge weight models possible our search algorithms are independent of how edge

weights are computed

13

Node Weight Node prestige based on indegree

More incoming edges higher prestige PageRank style transfer of prestige

Node weight computing using biased random walk model

Node weight: function of node prestige, other optional criteria such as TF/IDF

Answer-tree Node score NA = root node weight + leaf node weights

14

Overall Tree Answer Score

Overall score of answer tree A combine tree and node scores for details, and recall/precision metrics see

BANKS papers in ICDE 2002 and VLDB 2005 Anecdotal results on DBLP Bibliography

“Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations)

“soumen sudarshan”: several coauthored papers, links via common co-authors

“goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections

15

Answer Models

Tree Answer Model Proximity (near query) model

16

Proximity Queries

Node weight by proximity author (near olap) (on DBLP) faculty (near earthquake) (on IITB thesis

database) Node prestige > if close to multiple

nodes matching near keywords Example applications

Finding experts on a particular area

RaghuWidomOLAP over uncertain ..

Computing sparse cubes…Overview of OLAP…Allocation in OLAP …

17

Proximity via Spreading Activation

Idea: Each “near” keyword has activation of 1

Divided among nodes matching keyword, proportional to their node prestige

Each node keeps fraction 1-μ of its received activation and spreads fraction μ amongst its neighbors

Combine activation ai received from neighbors

a = 1 – Π(1-ai) (belief function)

Graph may have cycles Iterate till convergence

18

Example Answers

Anecdotal results on DBLP Bibliography author (near recovery): Dave Lomet, C.

Mohan, etc sudarshan(near change): Sudarshan

Chawate sudarshan(near query): S. Sudarshan

Queries can combine proximity scores with tree scores hector sudarshan(near query) vs.

hector sudarshan author(near transactions) data integration

19

Related Work

Proximity Search Goldman, Shivakumar,

Venkatasubramanian and Garcia-Molina [VLDB98]

Considers only shortest path from each node, aggregates across nodes

Our version aggregates evidence from alternative paths

E.g. author (near “Surajit Chaudhuri”)

Object Rank [VLDB04] Similar idea to ours, precomputed

20

Related Work

Keyword querying on relational databases DBExplorer (Microsoft, ICDE02) Discover (UCSD,

VLDB02, VLDB03), Use SQL generation, not applicable to arbitrary graphs ranking based only on #nodes/edges

Keyword querying on XML : Tree Model XRank (Cornell, SIGMOD03), proximity in XML (AT&T

Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04),

Tree model is too limited Keyword querying on XML: Graph Model

XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05)

ranking based only on #nodes/edges

21

Outline




Conclusion

22

Finding Answer Trees

Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): Intuition: find vertices from which a forward

path exists to at least one node from each Si. Run concurrent single source shortest path

algorithm from each node matching a keyword Create an iterator for each node matching a keyword

Traverse the graph edges in reverse direction Output next nearest node on each get-next() call

Do best-first search across iterators Output an answer when its root has been reached

from each keyword Answer heap to collect and output results in score order

23

Backward Expanding Search

Soumen C. Byron Domauthors

Focused Crawlingpaper

Query: soumen byron

writes

24

Backward Exp. Search Limitations

Too many iterators (one per keyword node) Solution: single iterator per keyword (SI-Bkwd search)

tracks shortest path from node to keyword Changes answer set slightly

Different justifications for same root may be lost Not a big problem in practice

Nodes explored for different keywords can vary greatly E.g. “mining” or “query” vs “knuth” High fan-out when traversing backwards from some

nodes Connection with join ordering

Similar to traversing backwards from all relations that have selections

25

Bidirectional Search: Motivation

26

Bidirectional Search: Intuition

First cut solution: Don’t expand backward if keyword matches

many nodes Instead explore forward from other keywords

Problems Doesn’t deal with high fan-out during search What should cutoff for not expanding be?

Better solution: [Kacholia et al, VLDB 2005] Perform forward search from all nodes reached Prioritize expansion of nodes based on

path weight (as in backward expanding search) + spreading activation

to penalize frequent keyword and bushy trees

27

Bidirectional Search: Example

Query: harper divesh olap

DiveshHarper

OLAP

28

Bidirectional Search (1)

Spreading activation to prioritize backward search (Different from spreading activation for

near queries) Lower weight edges get higher share of

activation Nodes prioritized by sum of activations

Single forward iterator

29


Forward search iterator Forward search from all nodes reached by

backward search Track best forward path to each keyword

Initially infinite cost Whenever this changes, propagate cost change to

all affected ancestors

k2k1

2,∞

∞,∞∞,2

2,2

30


On each path length update (due to backward or forward search) Check if node can reach all keywords If so, add it to output heap

When to output nodes from heap For each keyword Ki, track Mi

Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree

Edge score bounds: What is the best possible edge score of a future

answer? Bounds similar to NRA algorithm (Fagin) Cheaper bounds (e.g. 1/Max(Mi)) or

heuristics (e.g. 1/Sum(Mi)) can be used Output answer if its score is > overall score

upper bound for future answers

31

Performance

Worst case complexity: polynomial in size of graph But for typical (average) case, even linear is

too expensive Intuition: typical query should access only

small part of graph Studied experimentally

Datasets: DBLP, IMDB, US Patent Queries: manually created Typical cases

< 1 second to generate answer 10K-100K nodes explored

32

Performance Results Two versions of backward search:

Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd)

Origin size: number of nodes matching keywords

Time ratio MI/SI

Very minor loss in recall

33

Performance Results

SI-Bkwd versus Bidirectional search

Bidirectional search gain increases with origin size, # keywords

34

Related Work (1)

Publish as document approach Gather related data into a (virtual) document

and index the document (Su/Widom, IDEAS05)

Positives Avoids run-time graph search Works well for a class of applications

E.g. Bibliographic data DBLP page per author

Negatives Not all connections can be captured Duplication of data across multiple documents High index space overhead

35

Related Work (2)

DPBF (Ding et al., ICDE07) dynamic programming technique

exact for top-1 answer, heuristic for top-k

BLINKS (He et al.,SIGMOD07) Round-robin expansion across iterators

Optimal within a factor of m, with m keywords Forward index: node to keyword distance

Used instead of searching forward single level index: impractically large space bi-level index: > main memory IO

efficiency?

36

Outline




Conclusion

37

External Memory Graph Search

Graph representation quite efficient Requires of < 20 bytes per node/edge

Problem: what if graph size > memory? Alternative 1: Virtual Memory

thrashing Alternative 2 (for relational data): SQL

not good for top-K answer generation across multiple SQL queries

Alternative 3: use compressed graph representation to reduce IO

[Dalvi et al, VLDB 2008]

38

Supernodes and Superedges

39

Multi-granular Graph

Dumb algorithm search on supernode graph get k*F answers, expand their supernodes

into memory, search on resultant graph no guarantees on answers

Better idea: use multi-granular graph Supernode graph in memory Some nodes expanded

Expanded nodes are part of cache Algorithms on multi-granular graph (coming

up)

40

Multi-granular Graph

41

Expanding Nodes

Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer

42

Iterative Search

Iterative search on multi-granular graph Repeat

search on current multi-granular graph using any search algorithm, to find top results

expand super nodes in top results Until top K answers are all pure

Guarantees finding top-K answers Very good IO efficiency

But high CPU cost due to repeated work Details: nodes expanded above never

evicted from “virtual memory” cache

43

Incremental Search

Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph

Run search algorithm until top K answers are all pure

Currently implemented for backward search Modifies the state of the Dijkstra shortest

path algorithm used by backward search One shortest path search iterator per keyword SPI tree: shortest path iterator tree

44

Incremental Search (1)

SPI tree for k1

45


46


47

External Memory Search: Performance

Queries

48

External Memory Search: Performance

Supernode graph very effective at minimizing IO Cache misses with incremental often <

# nodes matching keywords Iterative has high CPU cost VM (backward search with cache as

virtual memory) has high IO cost Incremental combines low IO cost

with low CPU cost

49

Conclusions

Keyword search on graphs continues to grow in importance E.g. graph representation of extracted

knowledge in YAGO/NAGA (Max Planck) Ranking is critical

Edge and node weights, spreading activation

Efficient graph search is important In-memory and external-memory

50

Ongoing/Future Work

External memory graph search Compression ratios for supernode graph for

DBLP/IMDB: factor of 5 to 10 Ongoing work on graph clustering shows good

results

Graph search in a parallel cluster Goal: search integrated WWW/Wikipedia graph

New search algorithms Integration with existing applications

To provide more natural display of results, hiding schema details

Authorization

51

BANKS References

Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE 2002

User Interaction in the BANKS System, Demo paper,B. Aditya, Soumen Chakrabarti, Rushi Desai, Arvind Hulgeri, Hrishikesh Karambelkar, Rupesh Nasre, Parag, S. Sudarshan ICDE 2003

Bidirectional Expansion For Keyword Search on Graph Databases, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar,VLDB 2005

Keyword Search on External Memory Data GraphsBhavana Dalvi, Meghana Kshirsagar and S. Sudarshan,VLDB 2008

Thanks!

53

Time and Nodes Explored

BiDir Time

Bidir Nodes

54

Screenshots (1) author (near recovery)

55

Near Queries with Multiple Keywords

Spread activation from each keyword separately

Then combine the activations from different keywords

OR: use addition or belief combination

AND: take product of activations Gives better results

56

The BANKS System

Available on the web, with DBLP, IMDB and IITB ETD data http://www.cse.iitb.ac.in/banks/

No programming needed for customization Minimal preprocessing to create indices and give weights to links

Provides keyword search coupled with extensive browsing features Schema browsing + data browsing Hyperlinks are automatically added to all displayed results Browsing data by grouping and creating crosstabs Graphical display of data: bar charts, pie charts, etc

BANKSUserJDBCHTT

P

Web Server + Servlets

Database

57

BANKS Architecture

Data resident on disk Graph structure of data resident in

memory Nodes and edges with their types/counts

16x|V|+8x|E| bytes Search done in memory

Why Allows us to use interesting graph traversal

based algorithms without being constrained by SQL and related performance issues

With current memory sizes, database graphs for most applications will fit in memory

58

Probabilistic Edge Score Model (2)

Paths from root to leaves are considered separately, even if they share edges More efficient search algorithms with

this models (He et al., SIGMOD07)

0.5 0.5

0.167

1

keyword search on graph-structured data s. sudarshan iit bombay joint work with soumen chakrabarti,...

Documents

edge typepath weight

querying of data

data modelsrelationalnode

relationshipnetwork

related data

edge weightsallow answer

data default

successfulmuch data