presented by chandrashekar vijayarenu anirban maiti

18
EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA PAPER BY CHRISTOPHER R´E NILESH DALVI DAN SUCIU Presented By Chandrashekar Vijayarenu Anirban Maiti

Upload: abra-hyde

Post on 30-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

Efficient Top-k Query Evaluation on Probabilistic Data Paper By Christopher R´e Nilesh Dalvi Dan Suciu. Presented By Chandrashekar Vijayarenu Anirban Maiti. Agenda. Overview DNF Evaluation Query Evaluation Problem definition Monte Carlo (MC) Simulation Multi simulation(MS) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented By Chandrashekar Vijayarenu Anirban Maiti

EFFICIENT TOP-K QUERY EVALUATION ON PROBABILISTIC DATA

PAPER BY

CHRISTOPHER R´E NILESH DALVI

DAN SUCIU

Presented By

Chandrashekar Vijayarenu

Anirban Maiti

Page 2: Presented By Chandrashekar Vijayarenu Anirban Maiti

AGENDA

Overview DNF Evaluation Query Evaluation Problem definition Monte Carlo (MC) Simulation Multi simulation(MS) Experiments

Page 3: Presented By Chandrashekar Vijayarenu Anirban Maiti

PROBABILISTIC DATABASES Probabilistic database is an uncertain database in which the

possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1)

Example:

Page 4: Presented By Chandrashekar Vijayarenu Anirban Maiti

SELECT QUERY ON PROBABILISTIC DATABASE Find directors with a highly rated Drama and low rated

comedy

SELECT

DISTINCT d.dirName AS Director

FROM

AMZNReviews a, AMZNReviews b,

TitleMatch ax, TitleMatch by,

IMDBMovie x, IMDBMovie y,

IMDBDirector d

WHERE

a.asin=ax.asin and b.asin=by.asin

and ax.mid=x.mid and by.mid=y.mid

and x.did=y.did and y.did=d.did

and x.genre=’comedy’ and y.genre=’drama’

and abs(x.year - y.year) <= 5

and a.rating<2 and b.rating>4

Major challenge in probabilistic database is query evaluation.

Dalvi and Suciu have shown that most SQL queries are #P Complete.

This paper propose a new approach to query evaluation on probabilistic databases, by combining top-k style queries.

User specifies a SQL Query and a number k, and the

system returns the highest ranked k answers.

Page 5: Presented By Chandrashekar Vijayarenu Anirban Maiti

QUERY PROCESSING CHALLENGES

Compute exact output probabilities is computationally hard. Meaning, any algorithm computing the probabilities need to iterate through all possible subsets of TitleMatch.

Potential answers for which we need to calculate the probability is large. User is likely to end up inspecting just the first few of them.

This paper introduces the multisimulation algorithm which enables effective processing of probabilistic queries with some error guarantees.

Page 6: Presented By Chandrashekar Vijayarenu Anirban Maiti

OVERVIEW

Page 7: Presented By Chandrashekar Vijayarenu Anirban Maiti

POSSIBLE WORLDS A possible world is thus any subset of the tuples in the database and its

probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T:

A probabilistic database over schema S is a pair (W,P) where W =

{W1, . . . ,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world.

Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S obtained as described below

Page 8: Presented By Chandrashekar Vijayarenu Anirban Maiti

DNF FORMULAIn boolean logic, a disjunctive normal form

(DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses.

In our ExampleE = (t1 Λ t5) V t2 = true

in the possible worlds W3,W7,W10,W11, and its probability is thus

P(E) = P(W3) + P(W7) + P(W10) + P(W11).

Page 9: Presented By Chandrashekar Vijayarenu Anirban Maiti

QUERY EVALUATION DNF EVALUATIONqe = SELECT *

FROM

AMZNReviews a, AMZNReviews b,

TitleMatch ax, TitleMatch by,

IMDBMovie x, IMDBMovie y,

IMDBDirector d

WHERE ...

Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, axp, byp, x, y, d)

axp and byp are probabilistic tuples. (From TupleMatch table)

Thus, every row returned by qe defines a boolean expression t.E = axp Λ byp.

Page 10: Presented By Chandrashekar Vijayarenu Anirban Maiti

INTRODUCE GROUP BY

Next we group the rows by their directors, and for each group G = {(axp

1, byp1), . . . , (axp

m, bypm)}

DNF formula: G.E = (axp1 Λ byp

1) V . . . V (axpm Λ bxp

m)

The director’s probability give by P(G.E).

How to calculate the director’s probability?Brute Force Approach: Choose every possible world and

calculate the truth value of the boolean expression.

p = P(G.E) is the frequency with which G.E = true

#P Hard problem

Monte Carlo Simulation: Choose the possible world at random and calculate the truth value.

Page 11: Presented By Chandrashekar Vijayarenu Anirban Maiti

PROBLEM DEFINITION

G = {G1, . . . ,Gn} of n objects, with unknown Probabilities p1, . . . , pn, and a number k <= n.

Our goal is to find a set of k objects with the highest probabilities, denoted Top-K Subset of G.

Solution:The idea in our algorithm is to run in parallel

several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers

Page 12: Presented By Chandrashekar Vijayarenu Anirban Maiti

MONTE CARLO (MC) SIMULATION

An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true.

Luby and Karp have described the variant shown in Algorithm

fix an order on the disjuncts: t1, t2, . . . , tmC := 0repeat

Choose a random disjunct ti Є GChoose a random truth assignment s.t. ti.E = trueif forall j < i tj.E = false then C := C + 1

until N timesreturn ˜p = C/N

Page 13: Presented By Chandrashekar Vijayarenu Anirban Maiti

MULTISIMULATION (MS)

Assumptions:

Intervals:

Page 14: Presented By Chandrashekar Vijayarenu Anirban Maiti

MULTISIMULATION (MS)

Critical Region:The critical region, top objects, and bottom objects are:

(c, d) = (topk(a1, . . . , an), topk+1(b1, . . . , bn))……Eq (5)

T = {Gi | d <= ai}

B = {Gi | bi <= c}

Algorithm:MS TopK(G, k) : /* G = {G1, . . . ,Gn} */Let [a1, b1] = . . . = [an, bn] = [0, 1], (c, d) = (0, 1)while c <= d do

Case 1: choose a double crosser to simulateCase 2: choose upper and lower crosser to simulateCase 3: choose a maximal crosser to simulateUpdate (c, d) using Eq.(5)

end whilereturn TopK = T = {Gi | d <= ai}

Page 15: Presented By Chandrashekar Vijayarenu Anirban Maiti

EXAMPLE : LET US SELECT TOP 2.

0.1

G1

G2

G3

G4

G5

c d

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Page 16: Presented By Chandrashekar Vijayarenu Anirban Maiti

EXPERIMENTS

Page 17: Presented By Chandrashekar Vijayarenu Anirban Maiti

EXPERIMENTS RESULTS

Page 18: Presented By Chandrashekar Vijayarenu Anirban Maiti

THANK YOU