combining keyword search and forms for ad hoc querying of databases

Post on 31-Jan-2016

52 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Combining Keyword Search and Forms for Ad Hoc Querying of Databases. (Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton) Computer Sciences Department University of Wisconsin-Madison {ericc, baid, xchai, anhai, naughton} @cs.wisc.edu. Contents. Motivation Query Forms - PowerPoint PPT Presentation

TRANSCRIPT

Combining Keyword Search and Forms for Ad Hoc

Querying of Databases(Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton)

Computer Sciences DepartmentUniversity of Wisconsin-Madison

{ericc, baid, xchai, anhai, naughton} @cs.wisc.edu

Contents

• Motivation• Query Forms• Generating forms• Keyword Search for Forms• Displaying Returned Forms• Experimental Analysis• Related Work and References

Traditional Access Methods for Databases

• Advantages: high-quality results

• Disadvantages:– Query languages: long

learning curves– Schemas: Complex

Small user population “The usability of a database is as important as its capability”

Relational/XML Databases are structured or semi-structured, with rich meta-data Typically accessed by structured query languages: SQL

Motivation

Information discovery in databases requires: Knowledge of schema Knowledge of a query language (Example: SQL)

Challenges?• Hard for users uncomfortable with a formal

query language.

MotivationWhat is the solution?Form Based Interfaces and Keyword Search

Approach • User submits keyword query• System returns ranked list of relevant forms• User selects one of forms and builds structured

query

Relational Schema of DBLife

Entity tables:

person(id, name, homepage, title, group,organization, country)

publication(id, name, booktitle, year, pages, cites, clink, link)

topic(id, name)

organization(id, name)

conference(id, name)

Relationship Tables

related_people(rid, pid1, pid2, strength)

related_topic(rid, pid, tid, strength)

related_organization(rid, pid, oid, strength)

give_tutorial(rid, pid, cid)

give_conf_talk(rid, pid, cid)

give_org_talk(rid, pid, oid)

serve_conf(rid, pid, cid, assignment)

write_pub(rid, pid, pub_id, position)

co_author(rid, pid1, pid2, strength)

Query FormsInterface for a query template.

Example:Completed form over the person relation of DBLife.

• Query represented is

SELECT * FROM person WHERE organization = ‘Microsoft Research’

• General template for the above form

SELECT * FROM person WHERE name op value AND homepage op value AND title op value AND group op value AND organization op value AND country op value

How to generate forms?

Step 1: Specify a subset of SQL as the target language

to implement the queries supported by forms. SQL’

SQL’:

Let B = (SELECT select-list

FROM from-list

WHERE qualification

[GROUP BY grouping-list

HAVING group-qualification] UNION | INTERSECT)

Note: Nested queries are not allowed in FROM and WHERE clauses.

Step 2: Determine set of skeleton templates specifying the main

clauses and join conditions based on chosen subset of SQL and SD.

Let Ri be a relation following a relation schema Si S∈ D

Case 1: If Ri does not reference other relations with foreign keys.

SELECT * FROM Ri WHERE predicate-list

Case 2: If Ri references other relations with foreign keys.

SELECT * FROM <Ri and relations referenced>

WHERE < Join relations and for each attribute have “attr op value” predicate >

Example:

Relation : Give_Tutorial give_tutorial(rid,pid,cid)Relations Referenced: Person and Conferenceperson(id,name,homepage,title,group,organization,country)conference(id,name)

Skeleton Template:

SELECT *FROM give_tutorial t, person p, conference c WHERE t.pid = p.id AND t.cid = c.id AND p.name op expr AND … AND c.name op expr

Step 3: Finalize templates by modifying skeleton

templates based on form specificity. How specific or general we want the forms to

be?Form Specificity

Form Complexity Data Specificity

Initial State of the form

Adjusting form specificity:

Increase its complexity by adding more parameters.Decrease its complexity by removing parameters.Increase data specificity by binding more existing parameters to constants.Decrease data specificity by unbinding parameters with fixed vales.

Approach followed in this paper:

To adjust Form Complexity Divide SQL’ into 4 query classes:• SELECT: basic SELECT-FROM-WHERE construct• AGGR: SELECT with aggregation• GROUP: AGGR with GROUP BY and HAVING clauses• UNION-INTERSECT: a UNION or INTERSECT of two SELECT

To adjust Data Specificity • Bind “value” fields of the “attr op value” predicates in the

WHERE clause to data values.

Step 4: Map each template to a form

Standard form components:• Label• Drop down list• Input box• Button

Keyword Search for Forms Basic Idea Used to find relevant forms which are used to pose structured

queries. Basic Approach Naïve AND Returns forms containing all the terms from keyword query. Naïve OR Some forms would be returned if the query includes at least one

term. Drawback?Keyword query must have schema term(s).

Approaches proposed in this paper:

Check whether data terms from user query appear in database.

If yes, modify query with relevant schema terms.

• Double Index OR Evaluation done using OR semantics.

• Double Index AND Evaluation done using AND semantics.

Example: Information Need: For which conferences a researcher named “Widom” has

served on program committee. Keyword Query: “Widom Conference” Here, Data term = “Widom” Schema term = “Conference” Results obtained:• Naïve AND - No forms returned as “Widom” does not appear on any form.• Naïve OR - Ignores “Widom” and returns all forms that contain

“Conference”• DI OR – Rewritten query will be “Widom person conference” as “Widom”

appears in person table and evaluated with OR semantics.• DI AND - Two queries generated “person conference” and “widom

conference” ,evaluated with AND semantics and union of results returned.

DB Life

person(id, name, homepage, title, group,organization, country)

conference(id, name)

Double Index OR Implementation

Indexes Used: • DataIndex- Inputs a data term and returns a set of <tuple-id, table> pairs. • FormIndex-Inputs a term and returns a set of form-ids.

Input- Keyword QueryOutput- Set of form-id’s.

Step 1:• Probe DataIndex with each query term qi in a query Q.• If qi is a data term, DataIndex will return a set of <tuple-id,table> pairs.• Add each table to the set FormTerms.• Add qi to FormTerms.

Step 2:• Probe FormIndex with terms in FormTerms.• Return form containing at least one of these terms.

DI OR

Input: A keyword query Q = [q1 q2.... qn]Output: A set of form-ids F’Algorithm:FormTerms = {}, F’ = {}// Replace any data terms with table namesfor each qi Q∈if DataIndex(qi) returns <table, tuple-id> pairsAdd each table to FormTermsAdd qi to FormTerms // qi could be a form term// Get form-ids based on FormTermsFormIndex(FormTerms) => F’ // OR semanticsreturn F’

Double Index AND

• Generating all possible queries that result from replacing user supplied data terms with schema terms.

• Use AND semantics and return union of query results.

Problem?• Performing AND query with all the terms in FormTerms is wrong.

Why is this so?• Data term may appear in multiple unrelated tables such that no form would contain all these tables.

Concept of Bucket• For query “q1 AND q2” : “a S∈ q1 AND b S∈ q2,” where Sqi is a

“bucket” containing the form terms associated with qi, and a and b are two form terms from Sq1 and Sq2 correspondingly.

Double Index AND Implementation

Input- Keyword query.Output- Set of form-id’s.Step 1: • For each qi , initially bucket Sqi is empty.

• If the query contains data terms, DataIndex will return <table,tuple-id> pairs.

• For each table, add table to Sqi and FormTerms.

• Add qi to Sqi and FormTerms

Step 2:• Generate and add to SQ’ all distinct queries, each of which taking one term

from each Sqi.

• For each query in SQ’, probe the FormIndex and retrieve forms that have all terms in query.

DI AND

Input: A keyword query Q = [q1 q2.... qn]Output: A set of form-ids F’Algorithm:FormTerms = {}, F’ = {}// Replace any data terms with table namesfor each qi Q∈Sqi = {} // Bucket for qiif DataIndex(qi) returns <table, tuple-id> pairsfor each tableif table FormTerms∉Add table to Sqi and FormTermsif qi FormTerms∉Add qi to Sqi and FormTerms// Get form-ids based on SqiSQ’ = EnumQueries( Sqi) // Enumerate all unique queries,∀// each having one term from each Sqifor each Q’ SQ’∈FormIndex(Q’) => F’ // A.D semantics on FormIndexreturn F’

Example:• User wants to search for a person “John Doe” • “John Doe” is present in person table but is not involved in any

relationship.

What will be the output? {Forms from person table + Forms from tables which reference person} will

be returned.

User Action: User tries to enter “John Doe” in the field name in a form which is join of

say person and conference tables.

Output? No results returned ------ > DEAD FORMS

Double Index Join

• Used to perform a check to see if a form will return an answer if instantiated with data terms in the user query.

How is the check performed?

Step 1: • Given keyword query Q, probe DataIndex with each query term qi.• When qi is a data term that leads to set of <table ,tuple-id> pairs, look up

each table T in a schema graph for SD and find reference tables that reference T.

• For each reference table, check to see if it contains any tuple-id of T.• If No, retrieve the forms that contain both T and refTable and record these

“dead” forms in say X.

Step 2:• Return F’ – X. This filters the dead forms.

DI Join

Input: A keyword query Q = [q1 q2.... qn]Output: A set of form-ids F’Algorithm:FormTerms = {}, F’ = {}, X = {}for each qi Q∈Sqi = {}if DataIndex(qi) returns <table, tuple-id> pairsfor each table Tlet I be the set of tuple-ids from Tif T FormTerms∉Add T to Sqi and FormTermsSchemaGraph(T) returns refTablesfor each refTableif DataIndex(refTable:tid) is NULL for every tid I∈FormIndex(T AND refTable) => Xif qi FormTerms∉Add qi to Sqi and FormTerms// Get form-ids based on form termsSQ’ = EnumQueries( Sqi)∀for each Q’ SQ’∈FormIndex(Q’) => F’return F’ – X

Displaying Returned Forms

How are the returned forms ranked?

• Based on scoring function of Lucene index.

• Lucene score for a query Q and a document D is:score(Q,D) = coord(Q,D) * queryNorm(Q) * Σt in Q( tf(t in D) * idf(t)2 * t.getBoost() * norm(t,D) )

Problem?“Sister Forms”Illustration: User query – “Widom” Result of the query :

Impossible to find what user is looking for.

What is the solution?Grouping Forms:Approach 1: • Group consecutive sister forms with same score- first level groups• Group forms by the four query classes• Display the classes in the order of SELECT, AGGR, GROUP, and UNION-INTERSECT.Result of “Widom” query:

Problem?Non-consecutive sister forms join different first level groups having the same description.

Solution?Approach 2:• First group the returned forms by their table.• Order the groups by the sum of their scores.• Advantage No repetition

Experimental Analysis

Experimental Setup• Data set-DBLife • Generated set of forms F1• 14 skeleton templates, one for each of 5 Entity

tables and 9 Relationship tables• Created templates-1 SELECT, 5 AGGR,6 GROUP, 2

UNION-INTERSECT, so F1 had 196 forms.• Real life user study was done with 7 graduate

students who found answers for 6 information needs.

Experimental Analysis

• Comparing Naïve, Double-Index, and Double-Index-Join• Ranking and Displaying Forms• Which is the best approach? Why? Let’s find out.

Microsoft Office Word 2007 Document

Related Work and References• Jayapandian[11] proposed automatic form generation for a database based on a

sample query workload. [11] M. Jayapandian, H. V. Jagadish. Automating the Design and Construction of

Query Forms. ICDE 2006

• Liu [14] proposed to automatically distinguish between schema terms and value terms in keyword query.

[14] F. Liu, C. Yu, W. Meng, A. Chowdhury. Effective Keyword Search in Relational Databases. SIGMOD 2006

• BANKS[3] proposed supporting the “attribute = value” construct in keyword

queries. [3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword

Searching and Browsing in Databases using BANKS. ICDE, 2002.

• Luo [16] proposed to detect empty result queries by “remembering” results from previously executed empty results queries.

[16] G. Luo. Efficient Detection of Empty-Result Queries. VLDB 2006.

Thank You!

top related