efficient ir-style keyword search over relational databases vagelis hristidis university of...

32
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego • Luis Gravano Columbia University • Yannis Papakonstantinou University of California, San Diego

Upload: santiago-nudd

Post on 14-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Efficient IR-Style Keyword Searchover Relational Databases

• Vagelis Hristidis

University of California, San Diego

• Luis Gravano

Columbia University

• Yannis Papakonstantinou

University of California, San Diego

Page 2: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Motivation

• Keyword search is the dominant information discovery method in documents

• Increasing amount of data is stored in databases

• Plain text coexists with structured data

Page 3: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Motivation

• Up until recently, information discovery in databases required:– Knowledge of schema– Knowledge of a query language (e.g., SQL)– Knowledge of the role of the keywords

• Goal: Enable IR-style keyword search over DBMSs without the above requirements

Page 4: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

IR-Style Search over DBMSs

• IR keyword search well developed for document search

• Modern DBMSs offer IR-style keyword search over individual text attributes

• What is equivalent to document in databases?

Page 5: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Example – Complaints Database Schema

ComplaintsprodIdcustIddatecomments

ProductsprodIdmanufacturermodel

CustomerscustIdnameoccupation

Page 6: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Example - Complaints Database Data

tupleId prodId custId date comments

c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”

c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”

c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”

tupleId prodId manufacturer model

p1 p121 “Maxtor” “D540X”

p2 p131 “IBM” “Netvista”

p3 p141 “Tripplite” “Smart 700VA”

tupleId custId name occupation

u1 c3232 “John Smith”

“Software Engineer”

u2 c3131 “Jack Lucas”

“Architect”

u3 c3143 “John Mayer”

“Student”

Complaints

Customers Products

Page 7: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Example – Keyword Query [Maxtor Netvista]

tupleId prodId custId date comments

c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”

c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”

c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”

tupleId prodId manufacturer model

p1 p121 “Maxtor” “D540X”

p2 p131 “IBM” “Netvista”

p3 p141 “Tripplite” “Smart 700VA”

tupleId custId name occupation

u1 c3232 “John Smith”

“Software Engineer”

u2 c3131 “Jack Lucas”

“Architect”

u3 c3143 “John Mayer”

“Student”

Complaints

Customers Products

Page 8: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Keyword Query Semantics (definition of “document” in databases)

Keywords are:• in same tuple• in same relation• in tuples connected through primary-foreign key

relationships

Score of result:• distance of keywords within a tuple• distance between keywords in terms of primary-

foreign key connections• IR-style score of result tree

Page 9: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Example – Keyword Query [Maxtor Netvista]

tupleId prodId custId date comments

c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”

c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”

c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”

tupleId prodId manufacturer model

p1 p121 “Maxtor” “D540X”

p2 p131 “IBM” “Netvista”

p3 p141 “Tripplite” “Smart 700VA”

tupleId custId name occupation

u1 c3232 “John Smith”

“Software Engineer”

u2 c3131 “Jack Lucas”

“Architect”

u3 c3143 “John Mayer”

“Student”

Complaints

Customers Products

Results: (1) c3, (2) p2 c3, (3) p1 c1

Vagelis Hristidis
result (2) is higher than (3) because the attr-score of comments for c3 is higher than that of c1
Page 10: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Result of Keyword Query

Result is tree T of tuples where:

• each edge corresponds to a primary-foreign key relationship

• no tuple of T is redundant (minimality)

• - “AND” query semantics: Every query keyword appears in T

- “OR” query semantics: Some query keywords might be missing from T

Vagelis Hristidis
that is, no tuple with zero score can be a leaf
Page 11: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Score of Result T

• Combining function Score combines scores of attribute values of T

• One reasonable choice:

Score=aTScore(a)/size(T)

• Attribute value scores Score(a) calculated using the DBMS's IR “datablades”

Vagelis Hristidis
which generally use well-known IR, tf-idf ranking functions
Vagelis Hristidis
attrs is a better choice than tuples because they represent the minimal information unit. Attributes are grouped into tuples based on schema design decisions.
Page 12: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Shortcomings of Prior Work

• Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies

• No straightforward extension to improve efficiency by returning just top-k results

• Not good in handling free-text attributes[DBXplorer,DISCOVER]

vagelis
for example two attributes that contain a keyword are treated equally regardless of their total #words
Page 13: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Example – Keyword Query [Maxtor Netvista]

tupleId prodId custId date comments

c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”

c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”

c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”

tupleId prodId manufacturer model

p1 p121 “Maxtor” “D540X”

p2 p131 “IBM” “Netvista”

p3 p141 “Tripplite” “Smart 700VA”

tupleId custId name occupation

u1 c3232 “John Smith”

“Software Engineer”

u2 c3131 “Jack Lucas”

“Architect”

u3 c3143 “John Mayer”

“Student”

Complaints

Customers Products

Results: (1) c3, (2) p2 c3, (3) p1 c1

Score(c3) = 4/3

Score(p2 c3) = (1+4/3)/2 = 7/6

Score(p1 c1) = (1+1/3)/2 = 4/6

score

1/3

1/3

4/3

score

1

1

0

Vagelis Hristidis
result (2) is higher than (3) because the attr-score of comments for c3 is higher than that of c1
Page 14: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Architecture

Keywords

IR Engine

Tuple Sets

Candidate NetworkGenerator

DatabaseSchema

Execution Engine

Database

Candidate Networks

Parameterized,Prepared

SQL Queries

Top-k Joining Treesof Tuples

User

IR Index

ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]

ComplaintsQ ProductsQ

ComplaintsQ ProductsQ

ComplaintsQ Customer{} ComplaintsQ

ComplaintsQ Product{} ComplaintsQ

...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...

[Maxtor Netvista]

c3p2 c3p1 c2

Page 15: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Architecture

Keywords

IR Engine

Tuple Sets

Candidate NetworkGenerator

DatabaseSchema

Execution Engine

Database

Candidate Networks

Parameterized,Prepared

SQL Queries

Top-k Joining Treesof Tuples

User

IR Index

ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]

ComplaintsQ ProductsQ

ComplaintsQ ProductsQ

ComplaintsQ Customer{} ComplaintsQ

ComplaintsQ Product{} ComplaintsQ

...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...

[Maxtor Netvista]

c3p2 c3p1 c2

Page 16: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Candidate Network Generator

• Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002]

• Use single non-free tuple set for each relation– allows “OR” semantics

– fewer CNs are generated

– extra filtering step required for “AND” semantics

Page 17: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Candidate Network Generator Example

For query [Maxtor Netvista], CNs:• ComplaintsQ • ProductsQ

• ComplaintsQ ProductsQ

• ComplaintsQ Customer{} ComplaintsQ

• ComplaintsQ Product{} ComplaintsQ

Non-CNs:• ComplaintsQ Customer{} Complaints{}

• Product Q Complaints{} ProductQ

Page 18: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Architecture

Keywords

IR Engine

Tuple Sets

Candidate NetworkGenerator

DatabaseSchema

Execution Engine

Database

Candidate Networks

Parameterized,Prepared

SQL Queries

Top-k Joining Treesof Tuples

User

IR Index

c3p2 c3p1 c2

ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]

ComplaintsQ ProductsQ

ComplaintsQ ProductsQ

ComplaintsQ Customer{} ComplaintsQ

ComplaintsQ Product{} ComplaintsQ

...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...

[Maxtor Netvista]

Page 19: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Execution Algorithms

• Users usually want top-k results.• Hence, submitting to DBMS a SQL query for

each CN (Naïve algorithm) is inefficient.• When queries produce at most very few

results, Naïve algorithm is efficient, since it fully exploits DBMS.

• Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

Vagelis Hristidis
as prior works do
Page 20: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Sparse Algorithm: Example Execution

CN results score MFS

ProductsQ

ComplaintsQ

ComplaintsQ ProductsQ

ComplaintsQ tupleId score c2 7 c1 5 c3 1

ProductsQ tupleId score p1 9 p2 6 p3 1

c2 7 7

p1 9 9

c1 p1 (9+5)/2=7 (9+7)/2 = 8

•Best when query produces at most a few results

vagelis
Improves Naïve algorithm, by discarding CN with no chance of producing top-k results, given results so far.Compute Maximum Possible Score for each CN C, by combining top tuples of non-free tuple sets of C.
Page 21: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Single Pipelined Algorithm: Example Execution

ComplaintsQ tupleId score c2 7 c1 5 c3 1

ProductsQ tupleId score p1 9 p2 6 p3 1

CN: ComplaintsQ ProductsQ

MPFS = Max[(5+9)/2, (7+6)/2]=7Max[(1+9)/2, (7+6)/2]=6.5 result score

Results queue

p1→c1 7

Output: p1→c1

Max[(1+9)/2, (7+1)/2]=5

p2→c2 6.5

p2→c2

Get next tuple from

most promising non-free tuple set

vagelis
Operates on a single CN C.Read and process a minimum prefix from each non-free TS (ordered by score desc) of C.Keep Minimum Possible Future Score (MPFS) of C.For each newly retrieved tuple t, submit all possible parameterized queries joining t to already retrieved tuples of other non-free tuple sets.
Page 22: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Global Pipelined Algorithm : Example Execution

C4C5C1C3C2

C3

Queue of CN processesordered by ascending

MPFS

Processing Unit

Output

MPFS3 = 3.5

Complaints tupleId score c2 7 c1 5 c3 1

Products tupleId score p1 9 p2 6 p3 1

scoreresult scoreresult

Results queue

p1→c1 7

p2→c2 6.5

global MPFS=max(MPFSi) over all CNs Ci

• Best when query produces many results.

vagelis
Execute an instance of Single Pipelined algorithm for each CN in a round robin manner.Keep a global MPFS=max(MPFSi) over all CNs Ci.In each step retrieve next tuple from most promising tuple set of most promising CN.
Page 23: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Hybrid Algorithm

• Estimate number of results.– For “OR”-semantics, use DBMS estimator– For “AND”-semantics, probabilistically

adjust DBMS estimator.

• If at most a few query results expected, then use Sparse Algorithm.

• If many query results expected, then use Global Pipelined Algorithm.

Vagelis Hristidis
as prior works do
Page 24: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Related Work• DBXplorer [ICDE 2002], DISCOVER [VLDB 2002]

– Similar three-step architecture– Score = 1/size(T)– Only AND semantics– No straightforward extension for efficient top-k execution

• BANKS [ICDE 2002], Goldman et al. [VLDB 1998]– Database viewed as graph– No use of schema

• Florescu et al. [WWW 2000], XQuery Full-Text• Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001]

– Top-k algorithms for join queries

Vagelis Hristidis
submit a SQL query for each CN
Page 25: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Experiments – DBLP DatasetC(cid,name)

Y(yid,year,cid)

P(pid,title,yid)

A(aid,name)

PP(pid1,pid2)

PA(pid,aid)

DBLP contains few citation edges.

Synthetic citation edges were added such that average # citations is 20.

Final dataset is 56MB.

Experiments run over state-of-the-art commercial RDBMS.

C: Conference

Y: Year

P: Paper

A: Author

Page 26: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

OR Semantics: Effect of Maximum Allowed CN Size

10

100

1000

10000

100000

1000000

2 3 4 5 6 7

max CN size

mse

c

Naive Sparse SA SASymmetric GA GASymmetric Hybrid

Average execution time of 100 2-keyword top-10 queries

Vagelis Hristidis
only show results for OR0semantics in presentation
Page 27: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

OR Semantics: Effect of Number of Objects Requested k

Average execution time of 100 2-keyword queries with maximum candidate-network size of 6

10

100

1000

10000

100000

1000000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

ms

ec

Naive Sparse SA SASymmetric GA GASymmetric Hybrid

Page 28: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

OR Semantics: Effect of Number of Query Keywords

Average execution time of 100 top-10 queries with maximum candidate-network size of 6

100

1000

10000

100000

2 3 4 5#keywords

mse

c

Naive Sparse GA GASymmetric Hybrid

Page 29: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Conclusions

• Extend IR-style ranking to databases.• Exploit text-search capabilities of modern

DBMSs, to generate results of higher quality.• Support both “AND” and “OR” semantics.• Achieve substantial speedup over prior work

via pipelined top-k query processing algorithms.

Page 30: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Questions?

Page 31: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Compare algorithms wrt Result size

1

10

100

1000

10000

100000

0 100 500 1000 2000 6000total # results

mse

c

GA Sparse

OR-semantics

1

10

100

1000

10000

100000

0 5 20 100 200 700

total # results

mse

c

GA Sparse

Max CN size = 6, top-10, 2 keywords, OR-semantics

AND-semantics

Page 32: Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis

Ranking Functions

• Proposed algorithms support tuple monotone combining functions

• That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')