answering why-not questions on top-k queries

Answering Why-not Questions on Top-K

QueriesAndy He and Eric Lo

The Hong Kong Polytechnic University

Background The database community has

focused on the performance issues for decades

Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why

and Why-Not Questions)2/33

Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise”

E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”

You pose a why-not question (Q,R,m) Database returns you an explanation

E3/33

The (short) history of Why-Not

Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator

excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton

“On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]

SPJ Queries Explanation E =“tell you how to modify the

data”4/33

The (short) history of Why-Not

Herschel and Herandez “Explaining Missing Answers to SPJUA Queries”

[PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data”

Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD

10] SPJA Queries Explanation E =“tell you how to modify your

query”

5/33

About this work Why-Not question on Top-k queries. Hotel <Price, Distance to CityCenter>

Top-3 Hotel Weighting worigin =<0.5, 0.5> Result

Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental

“WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting?

Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result”

6/33

One possible answer-only modify k

Original query Q(koriginal=3,woriginal=<0.5,0.5>)

The ranking of Renaissance under the original weighting woriginal=<0.5,0.5> Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance

Refined query #1: Q1(k=3,w=<0.5,0.5>)

5

7/33

X

Another possible answer-only modify weighting

Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.1,0.9>

Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance

Refined query #2: Q2(k=3,w=<0.1,0.9>)

8/33

Yet another possible answer-modify both

Original query Q(k=3,w=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) Refined query #2: Q2(k=3,w=<0.1,0.9>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance

Refined query #3: Q3(k=10000,w=<0.9,0.1>)9/33

Our objective Find the refined query that minimizes

a penalty function with the missing tuple m in the Top-K results

Prefer Modify K PMK

Prefer Modify Weighting

PMW

Never Mind (Default) NM

10/33

Basic idea For each weighting wi ∈ W

Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the

weighting wi Form a refined query Qi(k=ri,w=wi)

Return the refined query with the least penalty

W is infinite!!

!

11/33

Our approach: sampling For each weighting wi ∈ W

Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the

weighting wi Form a refined query Qi(k=ri,w=wi)

Return the refined query with the least penalty

W is a set of weightings draw from a restricted weighting space

Key Theorem: The optimal refined query Qbest is either Q1 or else Qbest has a weighting

wbest in a restricted weighting space.

12/33

W

How large the sample size should be?

We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries

And we hope to get such a query with a probability larger than a threshold Pr

13/33

The PROGRESS operation can be expensive

Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance

Refined query: Q2(k=10000,w=<0.5,0.5>)

Very　Slow！！！

14/33

Two optimization techniques

Stop each PROGRESS operation early Skip some PROGRESS operations

15/33

Stop earlier The original query

Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>

Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D …

16/33

Skip PROGRESS operation(a)

Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10

Therefore The query result of PROGRESS(wx, UNTIL-SEE-

m) could be used to deduce

The query result of PROGRESS(wy, UNTIL-SEE-m)

[Provided that wx and wy are similar]

17/33

Skip PROGRESS operation(a)

E.g., Original query Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>)

Score under w=<0.5,0.5>Hotel ScoreSheraton 10Westin 9InterContinental

8

Hilton 7Renaissance 6

Score under w=<0.6,0.4>Hotel ScoreSheraton 9Westin 10InterContinental

7

Hilton 8Renaissance 5

How the score looks like if

we set w=<0.6,0.4>

18/33

Skip PROGRESS operation(b)

We can skip a weighting w if we find its change ∆w between the original weighting worigin is too large.

E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it.

19/33

Experiments Case Study on NBA data Experiments on Synthetic Data

20/33

Case study on NBA data Compare with a pure random

sampling version Which do not draw sample from the

restricted weighting space but from the complete weighting space

21/33

Find the top-3 centers in NBA history

5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW

Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal

22/33

Find the top-3 centers in NBA history

Sampling on the restricted sampling space

Sampling on the whole weighting space

Refined query Top-3 Top-7∆k 0 4Time (ms) 156 154Penalty 0.069 0.28

Why Not ?!

We choose “Prefer Modify Weighting”

23/33

Synthetic Data Uniform, Anti-correlated, Correlated Scalability

24/33

Varying query dimensions

25/33

Varying ko

26/33

Varying the ranking of the missing object

27/33

Varying the number of missing objects

28/33

Varying T%

29/33

Time Time

Quality Quality

Varying Pr

30/33

Optimization effectiveness

31/33

Conclusions We are the first one to answer why-not question

on top-k query We prove that finding the optimal answer is

computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted

sample space Two optimization techniques are proposed

Stop each PROGRESS operation early Skip some PROGRESS operations

32/33

ThanksQ&A

Deal with multiple missing objects M

We have to modify the algorithm a litte bit: Do a simple filtering on the set of

missing objects If mi dominates mj in the data space Remove mi from M Because every time mj

shows up in a top-k result, mi must be there Condition UNTIL-SEE-m becomes UNTIL-

SEE-ALL-OBJECTS-IN-M

34/33

Penalty Model Original Query Q(3, worigin) Refined Query Q1(5, worigin) Penalty of changing k

∆ k = 5 - 3 = 2 Penalty of changing w

∆ w = ||worigin -worigin||2=0 Basic penalty model

Penalty(5,w0) = λk ∆ k + λw ∆ w (λk + λw = 1)

35/33

Normalized penalty function

36/33

answering why-not questions on top-k queries

Data & Analytics