efficient keyword search across heterogeneous relational databases

Mayssam Sayyadian, AnHai DoanUniversity of Wisconsin - Madison

Hieu LeKhacUniversity of Illinois - Urbana

Luis GravanoColumbia University

Efficient Keyword Search acrossEfficient Keyword Search across Heterogeneous Relational DatabasesHeterogeneous Relational Databases

2

Key Message of PaperKey Message of Paper

Precise data integration is expensive But we can do IR-style data integration

very cheaply, with no manual cost!– just apply automatic schema/data matching – then do keyword search across the databases– no need to verify anything manually

Already very useful

Build upon keyword search over a single database ...

3

Keyword Search over Keyword Search over a Single Relational Databasea Single Relational Database

A growing field, numerous current works– DBXplorer [ICDE02], BANKS [ICDE02]– DISCOVER [VLDB02]– Efficient IR-style keyword search in databases [VLDB03], – VLDB-05, SIGMOD-06, etc.

Many related works over XML / other types of data– XKeyword [ICDE03], XRank [Sigmod03]– TeXQuery [WWW04]– ObjectRank [Sigmod06]– TopX [VLDB05], etc.

More are coming at SIGMOD-07 ...

4

A Typical ScenarioA Typical Scenario

tid custid name contact addr

t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR David Ross …

Customerstid id emp-name comments

u1 c124 Michael Smith Repair didn’t worku2 c124 John Deferred work to

John Smith

Complaints

Foreign-Key Join

Q = [Michael Smith Cisco]

Ranked list of answerst1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work

t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith

score=.8

score=.7

5

Our Proposal:Our Proposal:Keyword Search across Multiple DatabasesKeyword Search across Multiple Databases

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work

v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucas

IR-style data integration

tid eid reports-to

x1 e23 e37 x2 e14 e37

Groups

tid empid name

Employees

v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas


t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown …

Customers

tid id emp-name comments


John Smith

Complaints

across databases

Query: [Cisco Jack Lucas]

6

A Naive SolutionA Naive Solution

1. Manually identify FK joins across DBs

2. Manually identify matching data instances across DBs

3. Now treat the combination of DBs as a single DB apply current keyword search techniques

Just like in traditional data integration, this is too much manual work

7

Kite SolutionKite Solution

tid eid reports-to

x1 e23 e37 x2 e14 e37

Groups

tid empid name

Employees



t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown …

Customers



John Smith

Complaints

Automatically find FK joins / matching data instances across databases

no manual work is required from user

8

Automatically Find FK JoinsAutomatically Find FK Joinsacross Databases across Databases

Current solutions analyze data values (e.g., Bellman) Limited accuracy

– e.g., “waterfront” with values yes/no “electricity” with values yes/no

Our solution: data analysis + schema matching– improve accuracy drastically (by as much as 50% F-1)

tid empid name

Employees




John Smith

Complaints

Automatic join/data matching can be wrong incorporate confidence scores into answer scores

9

Incorporate Confidence Scores Incorporate Confidence Scores into Answer Scoresinto Answer Scores

α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)

size (A)score (A, Q) =

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8

Recall: answer example in single-DB settings

Recall: answer example in multiple-DB settings

t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work

v1 e23 Mike D. Smith x1 e23 e37

v3 e37 Jack Lucasscore 0.9 for FK join

score 0.7 for data matching

10

Summary of Trade-OffsSummary of Trade-Offs

Precise data integration– the holy grail

SQL queries

IR-style data integration, naive way– manually identify FK joins, matching data– still too expensive

IR-style data integration, using Kite– automatic FK join finding / data matching– cheap– only approximates the “ideal” ranked list found by naive

11

Kite ArchitectureKite Architecture

Q = [ Smith Cisco ]

…

Distributed SQL queries

D1 Dn

Index Builder

Foreign key joins

IR index1 IR indexn…

… D1 Dn

Refinementrules

Offline preprocessing Online querying

Condensed CN Generator

Top-k Searcher

Foreign-Key Join Finder

Data-based Schema

Join Finder Matcher

Data instancematcher

– Partial– Full– Deep

12

Online QueryingOnline Querying

Database 2

Relation 1 Relation 2

Database 1

Relation 1 Relation 2

What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers

13

Create Answer Templates Create Answer Templates

Find tuples that contain query keywords– Use DB’s IR index– example:

Q = [Smith Cisco]Tuple sets:

Create tuple-set graphSchema graph:

Tuple set graph:

Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}HR-DB: EmployeesQ={t1} GroupsQ={}

Customers Complaints Emps GroupsJ1 J4

J2

J3

Customers{} Complaints{} Emps{}

Groups{}

J1

CustomersQ ComplaintsQ EmpsQ

J1

J1

J1

J4

J4

J4

J4

J2J3

J3J2

Complaintsu1u2

Service-DB

Groupsx1x2

Employeest1t2t3

HR-DB

Customersv1v2v3

14

Create Answer Templates (cont.)Create Answer Templates (cont.)

Customers{} Complaints{} Emps{}

Groups{}

J1

CustomersQ ComplaintsQ EmpsQJ1

J1

J1

J4

J4

J4

J4

J2J3

J3J2

sample tuple set graph

EmpsQ Groups{} Emps{} Complaints{Q}J2 J3 J4

CN4:

sample CNs

EmpsQ Groups{} Emps{} Complaints{Q}J2 J2 J4

CN3:

CustomersQ Complaints{Q}CN2: J1

CustomersQCN1:

Search tuple-set graph to generate answer templates– also called Candidate Networks (CNs)

Each answer template = one way to join tuples to form an answer

15

Materialize Answer TemplatesMaterialize Answer Templatesto Generate Answersto Generate Answers

By generating and executing a SQL query

CN: CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2})

SQL: SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND

(C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2)

J1

Naive solution– materialize all answer templates, score, rank, then return answers

Current solutions– find only top-k answers– materialize only certain answer templates– make decisions using refinement rules + statistics

16

Challenges for Kite SettingChallenges for Kite Setting

More databases way too many answer templates to generate– can take hours on just 3-4 databases

Materializing an answer template takes way too long– requires SQL query execution across multiple databases– invoking each database incurs large overhead

Difficult to obtain reliable statistics across databases

See paper for our solutions

17

Empirical Evaluation Empirical Evaluation

Domain # DBs Avg # tables per DB

Avg # attributes per

schema

Avg # approximate FK joins tuples per table

Avg # tuples per table

Total size

total across DBs per pair

DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M

Domains

The DBLP Schema

CNF (id, name)

CITE (id1, id2)

AR (id, title)

AU (id, name)

AR (aid, biblo)

PU (aid, uid)

DBLP 1 DBLP 2

Sample Inventory Schema

WAREHOUSE

AUTHOR

BOOK

WH2BOOK

CD

ARTIST

WH2CD

Inventory 1

18

Runtime Performance (1)Runtime Performance (1)

Hybrid algorithm adapted to run over multiple databases

Kite without condensed CNs

Kite without adaptive rule selection and without rule Deep

Full-fledged Kite algorithm

Kite without rule Deep

0

60

120

180

1 2 3 4 5 6 7

time

(sec

)

Inventory

0

60

120

180

1 2 3 4 5 6 7 8 9

time

(sec

)

DBLP

maxCCNsize

maxCCNsize

2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases

0

15

30

45

1 2 3 4 5 6 7 8

time

(sec

)

Inventory

runtime vs. # of databases

#of DBsmaximum CCN size = 4, 2-keyword queries, k=10

runtime vs. maximum CCN size

19

Runtime Performance (2)Runtime Performance (2)runtime vs. # of keywords in the query

|q|0

5

10

15

20

1 2 3 4 5

time

(sec

) DBLP

max CCN=6, k=10, 2 databases

0

10

20

30

40

1 2 3 4 5

time

(sec

) Inventory

|q|max CCN=4, k=10, 5 databases

runtime vs. # of answers requested

0

15

30

45

1 4 7 10 13 16 19 22 25 27 300

15

30

45

1 4 7 10 13 16 19 22 25 27 30

time

(sec

) Inventory

k

2-keyword queries, max CCN=4, 5 databases

time

(sec

)

k

2-keyword queries, max CCN=4, |q|=2, 5 databases

20

Query Result QualityQuery Result Quality

0

0.2

0.4

0.6

0.8

1

1 5 10 15 200

0.2

0.4

0.6

0.8

1

1 5 10 15 20

Pr@k

k k

Pr@k

OR-semantic queries AND-semantic queries

Pr@k = the fraction of answers that appear in the “ideal” list

21

SummarySummary

Kite executes IR-style data integration– performs some automatic preprocessing– then immediately allows keyword querying

Relatively painless– no manual work!– no need to create global schema, to understand SQL

Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people– enterprises, on the Web, need only a few answers– emergency (e.g., hospital + police), need answers quickly

22

Future DirectionsFuture Directions Incorporate user feedback

interactive IR-style data integration

More efficient query processing– large # of databases, network latency

Extends to other types of data– XML, ontologies, extracted data, Web data

IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work

23

BACKUPBACKUP

24

Other ExperimentsOther ExperimentsJoin Discovery Accuracy

0

0.2

0.4

0.6

0.8

1

Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5

Join Discovery Join Discovery + Schema Matching

accu

racy

(F1)

0

2

4

6

1 2 3 4 5 6 7 8

Kite over single database

time

(sec

)

max CCN size

Schema matching helps improve join discovery algorithm drastically

Kite also improves single-database keyword search algorithm mHybrid

efficient keyword search across heterogeneous relational databases

Documents