answering why-not questions on top-k queries
TRANSCRIPT
Answering Why-not Questions on Top-K
QueriesAndy He and Eric Lo
The Hong Kong Polytechnic University
Background The database community has
focused on the performance issues for decades
Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why
and Why-Not Questions)2/33
Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise”
E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”
You pose a why-not question (Q,R,m) Database returns you an explanation
E3/33
The (short) history of Why-Not
Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator
excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton
“On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]
SPJ Queries Explanation E =“tell you how to modify the
data”4/33
The (short) history of Why-Not
Herschel and Herandez “Explaining Missing Answers to SPJUA Queries”
[PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data”
Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD
10] SPJA Queries Explanation E =“tell you how to modify your
query”
5/33
About this work Why-Not question on Top-k queries. Hotel <Price, Distance to CityCenter>
Top-3 Hotel Weighting worigin =<0.5, 0.5> Result
Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental
“WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting?
Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result”
6/33
One possible answer-only modify k
Original query Q(koriginal=3,woriginal=<0.5,0.5>)
The ranking of Renaissance under the original weighting woriginal=<0.5,0.5> Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance
Refined query #1: Q1(k=3,w=<0.5,0.5>)
5
7/33
X
Another possible answer-only modify weighting
Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.1,0.9>
Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance
Refined query #2: Q2(k=3,w=<0.1,0.9>)
8/33
Yet another possible answer-modify both
Original query Q(k=3,w=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) Refined query #2: Q2(k=3,w=<0.1,0.9>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance
Refined query #3: Q3(k=10000,w=<0.9,0.1>)9/33
Our objective Find the refined query that minimizes
a penalty function with the missing tuple m in the Top-K results
Prefer Modify K PMK
Prefer Modify Weighting
PMW
Never Mind (Default) NM
10/33
Basic idea For each weighting wi ∈ W
Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the
weighting wi Form a refined query Qi(k=ri,w=wi)
Return the refined query with the least penalty
W is infinite!!
!
11/33
Our approach: sampling For each weighting wi ∈ W
Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the
weighting wi Form a refined query Qi(k=ri,w=wi)
Return the refined query with the least penalty
W is a set of weightings draw from a restricted weighting space
Key Theorem: The optimal refined query Qbest is either Q1 or else Qbest has a weighting
wbest in a restricted weighting space.
12/33
W
How large the sample size should be?
We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries
And we hope to get such a query with a probability larger than a threshold Pr
13/33
The PROGRESS operation can be expensive
Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance
Refined query: Q2(k=10000,w=<0.5,0.5>)
Very Slow!!!
14/33
Two optimization techniques
Stop each PROGRESS operation early Skip some PROGRESS operations
15/33
Stop earlier The original query
Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D …
16/33
Skip PROGRESS operation(a)
Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10
Therefore The query result of PROGRESS(wx, UNTIL-SEE-
m) could be used to deduce
The query result of PROGRESS(wy, UNTIL-SEE-m)
[Provided that wx and wy are similar]
17/33
Skip PROGRESS operation(a)
E.g., Original query Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>)
Score under w=<0.5,0.5>Hotel ScoreSheraton 10Westin 9InterContinental
8
Hilton 7Renaissance 6
Score under w=<0.6,0.4>Hotel ScoreSheraton 9Westin 10InterContinental
7
Hilton 8Renaissance 5
How the score looks like if
we set w=<0.6,0.4>
18/33
Skip PROGRESS operation(b)
We can skip a weighting w if we find its change ∆w between the original weighting worigin is too large.
E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it.
19/33
Experiments Case Study on NBA data Experiments on Synthetic Data
20/33
Case study on NBA data Compare with a pure random
sampling version Which do not draw sample from the
restricted weighting space but from the complete weighting space
21/33
Find the top-3 centers in NBA history
5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW
Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal
22/33
Find the top-3 centers in NBA history
Sampling on the restricted sampling space
Sampling on the whole weighting space
Refined query Top-3 Top-7∆k 0 4Time (ms) 156 154Penalty 0.069 0.28
Why Not ?!
We choose “Prefer Modify Weighting”
23/33
Synthetic Data Uniform, Anti-correlated, Correlated Scalability
24/33
Varying query dimensions
25/33
Varying ko
26/33
Varying the ranking of the missing object
27/33
Varying the number of missing objects
28/33
Varying T%
29/33
Time Time
Quality Quality
Varying Pr
30/33
Optimization effectiveness
31/33
Conclusions We are the first one to answer why-not question
on top-k query We prove that finding the optimal answer is
computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted
sample space Two optimization techniques are proposed
Stop each PROGRESS operation early Skip some PROGRESS operations
32/33
ThanksQ&A
Deal with multiple missing objects M
We have to modify the algorithm a litte bit: Do a simple filtering on the set of
missing objects If mi dominates mj in the data space Remove mi from M Because every time mj
shows up in a top-k result, mi must be there Condition UNTIL-SEE-m becomes UNTIL-
SEE-ALL-OBJECTS-IN-M
34/33
Penalty Model Original Query Q(3, worigin) Refined Query Q1(5, worigin) Penalty of changing k
∆ k = 5 - 3 = 2 Penalty of changing w
∆ w = ||worigin -worigin||2=0 Basic penalty model
Penalty(5,w0) = λk ∆ k + λw ∆ w (λk + λw = 1)
35/33
Normalized penalty function
36/33