keyword search on graph-structured data s. sudarshan iit bombay joint work with soumen chakrabarti,...

58
Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Na Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana and Meghana Kshirsagar Jan 2009

Upload: catherine-armstrong

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

Keyword Search on Graph-Structured Data

S. Sudarshan IIT Bombay

Joint work withSoumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar

Jan 2009

Page 2: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

2

Outline

Motivation and Graph Data Model Query/Answer models

Tree answer model Proximity queries

Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs

Conclusion

Page 3: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

3

Keyword Search on Semi-Structured Data

Keyword search of documents on the Web has been enormously successful

Much data is resident in databases Organizational, government, scientific,

medical data Deep web

Goal: querying of data from multiple data sources, with different data models Often with no schema or partially

defined schema

Page 4: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

4

Keyword Search on Structured/Semi-Structured Data

Key differences from IR/Web Search: Normalization (implicit/explicit) splits related

data across multiple tuples To answer a keyword query we need to find a

(closely) connected set of entities that together match all given keywords

soumen crawling or soumen byron

Focused Crawling …

Soumen C. Byron Dom

writes

author

paper

Sudarshan

BANKS: Keyword search…

Page 5: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

5

Graph Data Model

Lowest common denominator across many data models Relational

node = tuple, edge = foreign key XML

Node = element, edge = containment/idref/keyref HTML

node = page, edge = hyperlink Documents

node = document, edge = links inferred by data extraction

Knowledge representation node = entity, edge = relationship

Network data e.g. social networks, communication networks

Page 6: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

6

Graph Data Model (Cont)

Nodes can have labels

E.g. relation name, or XML tag textual or structured (attribute-value)

data Edges can have labels

Page 7: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

7

Outline

Motivation and Graph Data Model Query/Answer models

Tree answer model Proximity queries

Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs

Conclusion

Page 8: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

8

Query/Answer Models

Basic query model: Keywords match node text/labels

Can extend query model with attribute specification, path specifications

e.g. paper(year < 2005, title:xquery),

Alternative answer models tree connecting nodes matching query

keywords nodes in proximity to (near) query

keywords

Page 9: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

9

Tree Answer Model

Answer: Rooted, directed tree connecting keyword nodes In general, a Steiner

tree Multiple answers

possible Answer relevance

computed from answer edge score

combined with answer node score

Focused Crawling

Soumen C. Byron Dom

writes writes

author author

paper

Eg. “Soumen Byron”

Page 10: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

10

Answer Ranking Naïve model: answers ranked by number

of edges Problem:

Some tuples are connected to many other tuples

E.g. highly cited papers, popular web sites Highly connected tuples create misleading

shortcuts six degrees of separation

Solution: use directed edges with edge weights allow answer tree to have edge uv if original

graph has vu, but at higher cost

Page 11: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

11

Edge Weight Model-1

Forward edge weight (edge present in data) Default to 1, can be based on schema Lower weight closer connection

Create extra backward edges vu for each edge uv present in data

Edge weight log(1+#edges pointing to v) Overall Answer-tree Edge Score EA = 1/ ( edge weights)

Higher score better result

1

1

1

3

3

3

Page 12: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

12

Edge Weight Model -2

Probabilistic edge scoring model Edge traversal probability (from a given node):

Forward 1/out-degree Backward 1/in-degree Can be weighted by edge type

Path weight = probability of following each edge in path

Edge score = log(edge traversal probability) Answer-tree Edge Score EA = (harmonic)

mean of path weights from root to each leaf

Note: other edge weight models possible our search algorithms are independent of how edge

weights are computed

Page 13: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

13

Node Weight Node prestige based on indegree

More incoming edges higher prestige PageRank style transfer of prestige

Node weight computing using biased random walk model

Node weight: function of node prestige, other optional criteria such as TF/IDF

Answer-tree Node score NA = root node weight + leaf node weights

Page 14: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

14

Overall Tree Answer Score

Overall score of answer tree A combine tree and node scores for details, and recall/precision metrics see

BANKS papers in ICDE 2002 and VLDB 2005 Anecdotal results on DBLP Bibliography

“Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations)

“soumen sudarshan”: several coauthored papers, links via common co-authors

“goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections

Page 15: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

15

Answer Models

Tree Answer Model Proximity (near query) model

Page 16: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

16

Proximity Queries

Node weight by proximity author (near olap) (on DBLP) faculty (near earthquake) (on IITB thesis

database) Node prestige > if close to multiple

nodes matching near keywords Example applications

Finding experts on a particular area

RaghuWidomOLAP over uncertain ..

Computing sparse cubes…Overview of OLAP…Allocation in OLAP …

Page 17: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

17

Proximity via Spreading Activation

Idea: Each “near” keyword has activation of 1

Divided among nodes matching keyword, proportional to their node prestige

Each node keeps fraction 1-μ of its received activation and spreads fraction μ amongst its neighbors

Combine activation ai received from neighbors

a = 1 – Π(1-ai) (belief function)

Graph may have cycles Iterate till convergence

Page 18: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

18

Example Answers

Anecdotal results on DBLP Bibliography author (near recovery): Dave Lomet, C.

Mohan, etc sudarshan(near change): Sudarshan

Chawate sudarshan(near query): S. Sudarshan

Queries can combine proximity scores with tree scores hector sudarshan(near query) vs.

hector sudarshan author(near transactions) data integration

Page 19: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

19

Related Work

Proximity Search Goldman, Shivakumar,

Venkatasubramanian and Garcia-Molina [VLDB98]

Considers only shortest path from each node, aggregates across nodes

Our version aggregates evidence from alternative paths

E.g. author (near “Surajit Chaudhuri”)

Object Rank [VLDB04] Similar idea to ours, precomputed

Page 20: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

20

Related Work

Keyword querying on relational databases DBExplorer (Microsoft, ICDE02) Discover (UCSD,

VLDB02, VLDB03), Use SQL generation, not applicable to arbitrary graphs ranking based only on #nodes/edges

Keyword querying on XML : Tree Model XRank (Cornell, SIGMOD03), proximity in XML (AT&T

Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04),

Tree model is too limited Keyword querying on XML: Graph Model

XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05)

ranking based only on #nodes/edges

Page 21: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

21

Outline

Motivation and Graph Data Model Query/Answer models

Tree answer model Proximity queries

Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs

Conclusion

Page 22: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

22

Finding Answer Trees

Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): Intuition: find vertices from which a forward

path exists to at least one node from each Si. Run concurrent single source shortest path

algorithm from each node matching a keyword Create an iterator for each node matching a keyword

Traverse the graph edges in reverse direction Output next nearest node on each get-next() call

Do best-first search across iterators Output an answer when its root has been reached

from each keyword Answer heap to collect and output results in score order

Page 23: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

23

Backward Expanding Search

Soumen C. Byron Domauthors

Focused Crawlingpaper

Query: soumen byron

writes

Page 24: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

24

Backward Exp. Search Limitations

Too many iterators (one per keyword node) Solution: single iterator per keyword (SI-Bkwd search)

tracks shortest path from node to keyword Changes answer set slightly

Different justifications for same root may be lost Not a big problem in practice

Nodes explored for different keywords can vary greatly E.g. “mining” or “query” vs “knuth” High fan-out when traversing backwards from some

nodes Connection with join ordering

Similar to traversing backwards from all relations that have selections

Page 25: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

25

Bidirectional Search: Motivation

Page 26: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

26

Bidirectional Search: Intuition

First cut solution: Don’t expand backward if keyword matches

many nodes Instead explore forward from other keywords

Problems Doesn’t deal with high fan-out during search What should cutoff for not expanding be?

Better solution: [Kacholia et al, VLDB 2005] Perform forward search from all nodes reached Prioritize expansion of nodes based on

path weight (as in backward expanding search) + spreading activation

to penalize frequent keyword and bushy trees

Page 27: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

27

Bidirectional Search: Example

Query: harper divesh olap

DiveshHarper

OLAP

Page 28: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

28

Bidirectional Search (1)

Spreading activation to prioritize backward search (Different from spreading activation for

near queries) Lower weight edges get higher share of

activation Nodes prioritized by sum of activations

Single forward iterator

Page 29: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

29

Bidirectional Search (2)

Forward search iterator Forward search from all nodes reached by

backward search Track best forward path to each keyword

Initially infinite cost Whenever this changes, propagate cost change to

all affected ancestors

k2k1

2,∞

∞,∞∞,2

2,2

Page 30: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

30

Bidirectional Search (3)

On each path length update (due to backward or forward search) Check if node can reach all keywords If so, add it to output heap

When to output nodes from heap For each keyword Ki, track Mi

Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree

Edge score bounds: What is the best possible edge score of a future

answer? Bounds similar to NRA algorithm (Fagin) Cheaper bounds (e.g. 1/Max(Mi)) or

heuristics (e.g. 1/Sum(Mi)) can be used Output answer if its score is > overall score

upper bound for future answers

Page 31: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

31

Performance

Worst case complexity: polynomial in size of graph But for typical (average) case, even linear is

too expensive Intuition: typical query should access only

small part of graph Studied experimentally

Datasets: DBLP, IMDB, US Patent Queries: manually created Typical cases

< 1 second to generate answer 10K-100K nodes explored

Page 32: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

32

Performance Results Two versions of backward search:

Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd)

Origin size: number of nodes matching keywords

Time ratio MI/SI

Very minor loss in recall

Page 33: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

33

Performance Results

SI-Bkwd versus Bidirectional search

Bidirectional search gain increases with origin size, # keywords

Page 34: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

34

Related Work (1)

Publish as document approach Gather related data into a (virtual) document

and index the document (Su/Widom, IDEAS05)

Positives Avoids run-time graph search Works well for a class of applications

E.g. Bibliographic data DBLP page per author

Negatives Not all connections can be captured Duplication of data across multiple documents High index space overhead

Page 35: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

35

Related Work (2)

DPBF (Ding et al., ICDE07) dynamic programming technique

exact for top-1 answer, heuristic for top-k

BLINKS (He et al.,SIGMOD07) Round-robin expansion across iterators

Optimal within a factor of m, with m keywords Forward index: node to keyword distance

Used instead of searching forward single level index: impractically large space bi-level index: > main memory IO

efficiency?

Page 36: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

36

Outline

Motivation and Graph Data Model Query/Answer models

Tree answer model Proximity queries

Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs

Conclusion

Page 37: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

37

External Memory Graph Search

Graph representation quite efficient Requires of < 20 bytes per node/edge

Problem: what if graph size > memory? Alternative 1: Virtual Memory

thrashing Alternative 2 (for relational data): SQL

not good for top-K answer generation across multiple SQL queries

Alternative 3: use compressed graph representation to reduce IO

[Dalvi et al, VLDB 2008]

Page 38: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

38

Supernodes and Superedges

Page 39: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

39

Multi-granular Graph

Dumb algorithm search on supernode graph get k*F answers, expand their supernodes

into memory, search on resultant graph no guarantees on answers

Better idea: use multi-granular graph Supernode graph in memory Some nodes expanded

Expanded nodes are part of cache Algorithms on multi-granular graph (coming

up)

Page 40: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

40

Multi-granular Graph

Page 41: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

41

Expanding Nodes

Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer

Page 42: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

42

Iterative Search

Iterative search on multi-granular graph Repeat

search on current multi-granular graph using any search algorithm, to find top results

expand super nodes in top results Until top K answers are all pure

Guarantees finding top-K answers Very good IO efficiency

But high CPU cost due to repeated work Details: nodes expanded above never

evicted from “virtual memory” cache

Page 43: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

43

Incremental Search

Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph

Run search algorithm until top K answers are all pure

Currently implemented for backward search Modifies the state of the Dijkstra shortest

path algorithm used by backward search One shortest path search iterator per keyword SPI tree: shortest path iterator tree

Page 44: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

44

Incremental Search (1)

SPI tree for k1

Page 45: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

45

Incremental Search (2)

Page 46: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

46

Incremental Search (3)

Page 47: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

47

External Memory Search: Performance

Queries

Page 48: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

48

External Memory Search: Performance

Supernode graph very effective at minimizing IO Cache misses with incremental often <

# nodes matching keywords Iterative has high CPU cost VM (backward search with cache as

virtual memory) has high IO cost Incremental combines low IO cost

with low CPU cost

Page 49: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

49

Conclusions

Keyword search on graphs continues to grow in importance E.g. graph representation of extracted

knowledge in YAGO/NAGA (Max Planck) Ranking is critical

Edge and node weights, spreading activation

Efficient graph search is important In-memory and external-memory

Page 50: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

50

Ongoing/Future Work

External memory graph search Compression ratios for supernode graph for

DBLP/IMDB: factor of 5 to 10 Ongoing work on graph clustering shows good

results

Graph search in a parallel cluster Goal: search integrated WWW/Wikipedia graph

New search algorithms Integration with existing applications

To provide more natural display of results, hiding schema details

Authorization

Page 51: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

51

BANKS References

Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE 2002

User Interaction in the BANKS System, Demo paper,B. Aditya, Soumen Chakrabarti, Rushi Desai, Arvind Hulgeri, Hrishikesh Karambelkar, Rupesh Nasre, Parag, S. Sudarshan ICDE 2003

Bidirectional Expansion For Keyword Search on Graph Databases, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar,VLDB 2005

Keyword Search on External Memory Data GraphsBhavana Dalvi, Meghana Kshirsagar and S. Sudarshan,VLDB 2008

Page 52: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

Thanks!

Page 53: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

53

Time and Nodes Explored

BiDir Time

Bidir Nodes

Page 54: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

54

Screenshots (1) author (near recovery)

Page 55: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

55

Near Queries with Multiple Keywords

Spread activation from each keyword separately

Then combine the activations from different keywords

OR: use addition or belief combination

AND: take product of activations Gives better results

Page 56: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

56

The BANKS System

Available on the web, with DBLP, IMDB and IITB ETD data http://www.cse.iitb.ac.in/banks/

No programming needed for customization Minimal preprocessing to create indices and give weights to links

Provides keyword search coupled with extensive browsing features Schema browsing + data browsing Hyperlinks are automatically added to all displayed results Browsing data by grouping and creating crosstabs Graphical display of data: bar charts, pie charts, etc

BANKSUserJDBCHTT

P

Web Server + Servlets

Database

Page 57: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

57

BANKS Architecture

Data resident on disk Graph structure of data resident in

memory Nodes and edges with their types/counts

16x|V|+8x|E| bytes Search done in memory

Why Allows us to use interesting graph traversal

based algorithms without being constrained by SQL and related performance issues

With current memory sizes, database graphs for most applications will fit in memory

Page 58: Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi

58

Probabilistic Edge Score Model (2)

Paths from root to leaves are considered separately, even if they share edges More efficient search algorithms with

this models (He et al., SIGMOD07)

0.5 0.5

0.167

1