probabilistic ranking of database query results surajit chaudhuri, microsoft research gautam das,...

35
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA 1

Upload: noreen-simmons

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

1

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Raghunath Ravi

Sivaramakrishnan SubramaniCSE@UTA

Page 2: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

2

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 3: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

3

MotivationMany-answers problemTwo alternative solutions:

Query reformulation Automatic rankingApply probabilistic model in IR to

DB tuple ranking

Page 4: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

4

Example – Realtor DatabaseHouse Attributes: Price, City,

Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!

Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

Page 5: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

5

Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of

Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock

Many Bedrooms Good School District

Page 6: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

6

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 7: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

7

Key ProblemsGiven a Query Q, How to

Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

Page 8: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

8

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 9: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

9

System Architecture

Page 10: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

10

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking

FunctionImplementationExperimentsConclusion and open problems

Page 11: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

11

PIR Review

Bayes’ RuleProduct Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

Vagelis Hristidis
Let's see how by adapting PIR techniques to our problem we can create a ranking function.
Page 12: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

12

Adaptation of PIR to DBTuple t is considered as a

documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and

YDerive from initial scoring

function until final ranking function is obtained

Page 13: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

13

Preliminary Derivation

Page 14: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

14

Limited Independence AssumptionsGiven a query Q and a tuple t,

the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(

Page 15: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

15

Continuing Derivation

Page 16: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

16

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in W

Relative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Yy XxYy DyxpDyp

RyptScore

),|(

1

)|(

)|()(

Use Workload

Use Data

Page 17: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

17

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 18: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

18

Architecture of Ranking Systems

Page 19: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

19

Scan Algorithm

Preprocessing - Atomic Probabilities Module

Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each

Result-TupleReturn Top-K Tuples

Page 20: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

20

Beyond Scan Algorithm

Scan algorithm is InefficientMany tuples in the answer set

Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

Page 21: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

21

Output from Index Module

CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

Page 22: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

22

Index Module

Page 23: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

23

Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and

Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate

and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores

Execution Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

Page 24: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

24

List Merge Algorithm

Page 25: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

25

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems

Page 26: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

26

Experimental Setup

Datasets:◦ MSR HomeAdvisor Seattle

(http://houseandhome.msn.com/)◦ Internet Movie Database

(http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

Page 27: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

27

Quality ExperimentsConducted on Seattle Homes and

Movies tablesCollect a workload from usersCompare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

Page 28: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

28

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Page 29: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

29

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

Page 30: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

30

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

Page 31: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

31

Performance Experiments – Pre-computation Time

Page 32: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

32

Performance Experiments – Execution Time

Page 33: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

33

Roadmap

MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open

problems

Page 34: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

34

Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-

Answers Problem which Leverages Data and Workload Statistics and Correlations

Based on PIR

DrawbacksMutiple-table queryNon-categorical attributes

Future WorkEmpty-Answer ProblemHandle Plain Text Attributes

Page 35: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International

35

Questions?