keyword search on form results

Click here to load reader

Post on 08-Jan-2016

51 views

Category:

Documents

3 download

Embed Size (px)

DESCRIPTION

Keyword Search on Form Results. Aditya Ramesh (Stanford) * S Sudarshan (IIT Bombay) Purva Joshi (IIT Bombay). * Work done at IIT Bombay. Keyword Search on Structured Data. Allows queries to be specified without any knowledge of schema Lots of papers over the past 13 years - PowerPoint PPT Presentation

TRANSCRIPT

  • *Keyword Search on Form Results Aditya Ramesh (Stanford)*S Sudarshan (IIT Bombay) Purva Joshi (IIT Bombay)*Work done at IIT Bombay

  • *Keyword Search on Structured DataAllows queries to be specified without any knowledge of schemaLots of papers over the past 13 yearsTree as answers, Entities/virtual documents as answers, ranking, efficient searchBut why has adoption in the real world remained elusive?Answers are not an a human usable form Users forced to navigate through schema in the answers

  • *Search on Enterprise Web ApplicationsUsers interact with data through applicationsApplications hide complexities of underlying schemaAnd present information in a human friendly fashion Applications have large numbers of formsHard for users to find information, built in search often incompleteForms sometimes map information only in one directione.g. student ID to name, but not from name to student IDNice talk motivating keyword search on enterprise Web applications by Duda et al, CIDR 2007http://univ.edu/acadrecords/studentinfo?ID=12345678 grade, contact, and other information

  • *Problem StatementSystem Model:Set of forms, each taking 0 or more parametersResult of a form = union of results of one or more parametrized queriesE.g. studentinfo form with parameter $ID displays name and grades of the student select ID, name from student where ID = $IDselect * from grades where ID = $IDKeyword search on form resultsgiven set of keywords, return (form ID, parameter) combinations whose result contains given keywordsRanked in a meaningful order

  • *Related WorkLots of papers on search (BANKS, Discover, DBXplorer, )Dont address presentation of resultsPrecis, Qunits, Object summaries Address presentation of information related to entitiesBut dont address search Predicate-based indexing (Duda et al. [CIDR 2007]) Materializes and indexes form results for all possible parameter valuesBut materialized results must be maintainedSame problem with virtual documents (Su and Widom [IDEAS05]) Efficient maintenance not discussed in prior workOur experimental results show high cost even with efficient incremental view maintenanceFind potentially relevant forms from a pre-generated set of formsChu et al. (SIGMOD 2009, VLDB 2010) But do not generate parameter values

  • *System Model + AssumptionsForm queries take parameters which come directly from form parametersOnly mandatory parameters, no optional parametersParameters prefixed with $: e.g. $Id, $deptE.g. namedept = $dept (prof) Query Q: maps parameters P to resultsInverted query IQ: maps keywords K to parameters P, s.t. Q(P) contains KSafety: inverted query may have infinite # of resultsQ: namedept > $dept (prof) Q: namedept = $dept Id=$Id (prof)

  • *Sufficient Conditions for SafetyRestrictions on form queries to ensure safetyEach parameter must be equated to some attributeE.g. r.aj = $Pi; r.aj is a called a parameter attributeAbove must appear as a conjunct in overall selection predicateSee paper for a few more restrictions for outerjoins and NOT IN/NOT Exists subqueries (antijoins)In some cases queries can be rewritten to satisfy above conditionsE.g. if parameter values for $P must appear in R(A), rewrite Q to Q A=$P (R)We handle some unsafe cases by using a * answer representation e.g. (Form 1, $dept = CS and $Id = *)

  • *Query Inversion 1:1 Keyword Independent Inverted Query (KIIQ)Intuition: Output parameter value along with resultfor all possible parameter valuesHow?: Drop parameter predicate, e.g. Id = $Id and add parameter attribute, e.g. Id, to projection listExample:Q= name Id=$Id (prof) KIIQ= name, Id (prof) Issue: what if intermediate operation blocks parameter attribute from reaching top of query?Selection/join: not an issueProjection: Just add parameter attribute to projection listAggregation, etc: will see later. 1 Acknowledgement: Idea of inversion arose during discussions with Surajit Chaudhuri

  • *Query Inversion 2:Keyword Dependent Inverted Query (IQ) Add selection on keyword, and output only parameter values IQ= $params(keyword-sels(KIIQ)) E.g.: Q= name Id=$Id (prof ) Keyword query= {John}KIIQ= Id (prof )IQ= Id (Contains((name, Id), John)(prof ))Contains((R.A1,R.A2,..),K) efficiently supported using text indicesParameter attributes like Id included in Contains even though if not in projection list, Multiple keyword: use intersectionE.g. K = {John, Smith}Id (Contains((name, Id), John)(prof )) Id (Contains((name, Id), Smith)(prof ))

  • *Queries With Multiple RelationsQ= name, teaches.ctitle ^ Id=$Id (prof teaches) Id and Name attributes of profKIIQ= Id,name, teaches.ctitle (prof teaches) IQ= IdContains((Id,name,teaches.ctitle), John ) ( (prof teaches))BUT most databases wont support keyword indexes across multiple relations, so we split intoId (Contains((Id,name), John ) Contains((teaches.ctitle), John ) ( (prof teaches)))Alternative using union more efficient in practiceId (Contains((Id,name), John ) ( (prof teaches))) U Id (Contains((teaches.ctitle), John ) ( (prof teaches)))Note: Contains predicate will usually get pushed below join by query optimizer

  • *Complex QueriesWe focus on creating KIIQKey intuition: pull parameter attributes to top after removing parameter selectionUsual way of converting KIIQ to IQ Pulling Parameter Attribute above AggregationE.g. Q= Asum(B) ( Id=$Id ( E)) KIIQ(Q) = A,Idsum(B) ( ( E)) IntersectionQ= Q1 Q2KIIQ(Q) = KIIQ(Q1) KIIQ(Q2)Note that parameters may be different for Q1 and Q2

  • *Union Queries and Multiple Query FormsForms with multiple queriesForm result = union of query results Case of union queries is similarE.g. Given Id as parameter, print name of professor and titles of courses taught name Id=$Id (prof ) and ctitle Id=$Id (teaches) Case 1: Single keyword, same parameters for all queriesIQ = union of IQ for each query E.g. Id Contains((Id,name), John) (prof ) U Id Contains((Id,ctitle), John) , (teaches ) Does not work if different sets of parameters

  • *Multiple Query: Case 2Single keyword, different parameters across queriesE.g. name Id=$Id (prof ) and ctitle dept=$dept (teaches ) Define dont care value : * (matches all values) Id,* Contains((Id,name), John) (prof ) U *,dept Contains((dept,ctitle), John) (teaches )Multiple keyword, different parametersDo as above for each keyword: IQk1, IQk2 Intersect results: IQk1 IQk2 Intersection not trivial due to *Two approaches: KAT and QAT

  • *KAT: Keyword at a TimeGiven queries Qi, Keywords Kj, and parameters PkFor each Qi, Kj, let QiKj = result of inverted query for Qi on Kj, with * for each parameter Pk not in QiEg: Q1Kj: Id,Dept,* Q2Kj: Id, *, YearThen combine answers, but using binding patternsUsing joins on non-* parametersQ1K1-Q1K2: Join on Id, DeptQ1K1-Q2K1, Q1K2-Q2K1: Join on IdQ2K1-Q2K2: Join on Id, YearFurther details in paper

    Bug in our implementation generated huge SQL query (100K line 14 MB) which PostgreSQL executed in around 90 secs.

  • *QAT: Query at a TimeGiven queries Qi, and Keywords KjCreate result QiKj for each keyword/query combo.For each Qi combine results for all Kj, using bitmapE.g. R1: (Id, Dept, bitmap), Bitmap: 1 bit per keyword R2:(Id, Year, bitmap)Then combine answers, but using binding patternsCase 1: 2 queries: R = R1 R2, and merge bitmapsCase 2: All queries have same parametersAgain use full outerjoin and merge bitmapsGeneral case: R = R1 U+ R2 U + R1 R2U+ denotes outer union; merge bitmaps as before Finally, filter out results using bitmapDetails in paper

  • *Other CasesSubqueries:Trivial if subqueries dont have parametersIN/EXISTS/SOME subqueriesBasic approach: decorrelate subqueries where possibleNOT IN, NOT EXISTS, ALL subqueries (antijoin) disallow parameters in such subqueries (not safe)Static/application generated text in formsRemove from keyword query if present in form

  • *RankingMotivation for rankingForm 1: Courses taught by particular instructorForm 2: Courses in a particular departmentForm result size much largerForm 3: Courses taken by particular studentForm result is small, but many parameter valuesWe rank forms, and rank parameters within formsRanking of forms No rankingAvg: Average size of form result (precomputed) AvgMult: Avg form result size * Number of distinct result parameter values

    Ranking of parameters within form based on heuristicsE.g. current user ID/year/semester, department of current user, ...

  • *Performance StudyIIT-Bombay Database ApplicationReal application90 forms,1 GB of dataQueries used: model realistic goals for students and facultyBasic desktop machine with low end disk and generic 64 GB SATA MLC Flash disk

  • *Result/Ranking QualityFormulated several queries seeking information from academic databaseFound position of form returning desired answerAverage position: 2.42 for AVG, 1.83 for AVGMULTMax position: 6 for AVG, 3 for AVGMULTHeuristics for ranking parameters within form worked wellNeed to generalize heuristics: future work

  • *Scalability with #Keywords + Hard Disk vs FlashSet of 5 keywordsfor N < 5 keywords, avg of all subsets of size NCold cache: restart DB, flush file system cacheRecommend flash storage for best performance

  • *KAT vs QAT: QAT slightly faster

    Keyword Performance: KAT vs QAT

  • *Scalability With #Forms

    Sublinear scaling with #formsPruning optimization: eliminate query if some keyword is not present in any of its relationsWorks very well

  • *Form Result MaterializationOverheads of form materialization approachImplemented incremental view maintenance for form queries on updates to underlying relationsTime overhead of 1 second on flash for adding course registrations, which normall