top-k query processing · 2016. 10. 23. · 4 top-k query example • assume that we have a cluster...
TRANSCRIPT
1
Top-K Query Processing
D. Gunopulos
1
Multimedia Top-K QueriesThe IBM QBIC project (90’s):
How to store and index multimedia objects?How to store and index multimedia objects?
• Multimedia objects can have many attributes,with numerical, “fuzzy”, valuese.g. Figure1: 0.7 red, Figure2: 0.4 red, same for blue, etc
2
• How to find similar objects?
2
Retrieving Multimedia ObjectsMust address the similarity question• The user specifies the query as a function of
the attributes:Find the objects most similar to:
(Red = 50), (Green = 20), (Blue = 30)with Red twice as important as Green or Blue
How to rank the objects?
3
How to rank the objects?• Must find the best (most similar) objects
without accessing everything
The right model: Top-K Queries!
Data Management & Query Processing Today
4
We are living in a world where data is generatedAll The Time & Everywhere
3
Characteristics of the Applications• “Data is generated in a distributed
fashion” e.g. sensor data, file-sharing data, Geographically Distributed Clusters)Geographically Distributed Clusters)
• “Distributed Data is often outdated before it is ever used”(e.g. CCTV video traces, Internet ping data, sensor readings, weblogs, RFID Tags,…)
5
• “Transferring the Data to a centralized repository is usually more expensive than storing it locally”
Motivating Problems• “In-situ Data Storage & Retrieval”
– Data remains in-situ (at the generating site).When users want to search/retrieve some– When users want to search/retrieve some information they perform on-demand queries.
• Challenges:– Combine different attributes and data sources– Minimize the utilization of the communication
di
6
medium – Exploit the network and the inherent parallelism
of a distributed environment. Focus on ubiquitous Hierarchical Networks (e.g. P2P, and sensor-nets).
– Number of Answers might be very large Focus on Top-K queries
4
Top-K Query Example• Assume that we have a cluster of n=5
webservers.• Each server maintains locally the same m=5 y
webpages.• When a web page is accessed by a client, a
server increases a local hit counter by one.
7
Top-K Query Example
• TOP-1 Query: “Which Webpage has the highest number of hits across all servers (i.e. highest S ( ) )?”Score(oi) )?”
• Score(oi) can only be calculated if we combine the hit count from all 5 servers.
Local score
8
{m
n
URL
TOTAL SCORE
5
Top-K Query ProcessingOther Applications
C ll b ti S D t ti N t k• Collaborative Spam Detection Networks• Content Distribution Networks• Information Retrieval• Sensor Networks
9
Top-K Query Processing Setting
Setting for this talk:• Vertical partitioning:
Independent accessIndependent access for each attribute (or sets of attributes)
• Assume index or sorted access per attribute
10
attribute• Centralized or
distributed setting
6
Top-K Query Processing Setting
What kinds of queries?• Monotone functions of the attributes:
f( ) f( ) if– f(x11, x21,x31) >= f(x12,x22,x32) if x11 >= x12 and x21>= x22 and x31 >= x32
• Typically assume linear functions
21 23 XXfQ +=tid1
1
QX1RT (1,1)(0,1)
11X2
O P(0,0)
(1,0)
Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques
F i ’ Al ith– Fagin’s Algorithm– Optimal Algorithms: TA (Threshold Algorithm)– Restricted Access Models: TA-Sorted– Probabilistic TA-Sorted– Using previous query instantiations: LPTA
• Distributed techniques
12
• Distributed techniques• Online Algorithms for Monitoring Top-K results• Future Work
7
Fagin’s Algorithm
[Fagin, PODS’98], [Fagin, Lotem, Naor, PODS’01]
The first efficient algorithm
FΑ Algorithm1) Access the n lists in parallel.2) Stop after the values of K objects have been found
Assumes an index per attribute
13
3) While some object oi is seen, but not resolved, perform a random access to the other lists to find the complete score for oi. 4) Return the K objects with the highest score
Fagin’s Algorithm (Example)
o3 4 05/5= 81
v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67
TOP-K
O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
Iteration 1
O3, 405O1, 363O4, 207
14
Iteration 2
Resolve o3, partial scores for o1, o4For Top-1, we resolve o1, o4 with random accesses
For Top-2, we continue with Iteration 3
Partial Scores for o1, 03
8
Fagin’s* Threshold Algorithm
Long studied and well understood.* C tl d l d b 3
[Fagin, Lotem, Naor, PODS’01], [Guntzer, Balke, Kieling, VLDB’00][Nepal, Ramakrishna, ICDE’99]
* Concurrently developed by 3 groups
ΤΑ Algorithm1) Access the n lists in parallel.2) While some object oi is seen, perform a random access to the other lists to find the complete score for oi. 3) Do the same for all objects in the current row
15
3) Do the same for all objects in the current row.4) Now compute the threshold τ as the sum of scores in the current row.5)The algorithm stops after K objects have been found with a score above τ.
The Threshold Algorithm (Example)
o3 4 05/5= 81
v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67
TOP-K
O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
Iteration 1 Threshold
O3, 405O1, 363O4, 207
16
Have we found K=1 objects with a score above τ? => ΝΟ
Have we found K=1 objects with a score above τ? => YES!
τ = 99 + 91 + 92 + 74 + 67 => τ = 423
Iteration 2 Thresholdτ (2nd row)= 66 + 90 + 75 + 56 + 67 => τ = 354
9
Comparison of Fagin’s and Threshold Algorithm
• TA sees less objects than FA
• TA may perform more random accesses than FA
• TA requires only bounded buffer space (k)• At the expense of more random seeks• FA makes use of unbounded buffers
17
FA makes use of unbounded buffers
Optimal algorithms
Algorithm B is instance optimal over
set of algorithms A and set of inputs D if :
B E A and Cost(B,D ) = O(Cost(A,D )) A E A, D E D
Which means that:
Cost(B,D ) ≤ c · Cost(A,D ) + c’, A E A, D E D
Theorem [Fagin et. al. 2003]: TA is instance optimal for every monotone
18
TA is instance optimal for every monotoneaggregation function, over every database(excluding wild guesses)
10
TA-Sorted
TA makes random accesses, assumes they are possible and inexpensive
In many situations, random accesses are muchIn many situations, random accesses are much more expensive than sequential access, or may be difficult to implement.
TA-sorted uses sequential access onlyAssumes sorted access for each attribute
ΤΑ-Sorted Algorithm
19
ΤΑ Sorted Algorithm1) Access the n lists in parallel.2) When some object oi is seen,
update its Upper and Lower bound. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.
The TA-Sorted Algorithm (Example)
o3 4 05/5= 81
v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67
TOP-K
O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
Iteration 1: o1 = [183, 423], o3 = [240,423]
O3, 405O1, 363
20
Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354]
Iteration 3: o1 = 363, o3 = 405, o4 = [137, 317]
11
TA-Sorted and Relational Systems
The sorted access makes it easier and
[Ilyas, Aref, Elmagarmid, VLDB’03][Tsaparas, Palpanas, Kotidis, Koudas, Srivastava, ICDE’03][Bruno, Chauduri, Gravano, ICDE’01]
The sorted access makes it easier and conceptually simpler to integrate to relational Database Management systems:
Ilyas et al preset the RNA-RJ Rank-Join Query operator
tid11
QX1RT (1,1)(0,1)
21
Bruno et al show that Top-K queries can be reduced to multidimensional range queries(using histograms to model
the data distribution) X2O P(0,0)
(1,0)
Probabilistic TA-Sorted
TA-Sorted can keep large intermediary resultsA smart idea: Use information about the
[Theobald, Weikum, Scenkel, VLDB’04]
distribution of the values to eliminate objects that are unlikely to be in the Top-K result
The key: compute probabilistic guarantees
Probabilistic ΤΑ-Sorted
22
1) Access the n lists in parallel.2) When some object oi is seen, update it’s Upper and Lower. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.
12
Probabilistic TA-SortedHow to compute probabilistic guarantees?Need an estimate for the score of unseen objects• Assume attribute independenceAssume attribute independence• Compute a model of the distribution in each
attribute Xi• Use convolution to estimate the distribution of
X1 + X2X1
X1+X2
23
Use histograms to model data distributions:
X2
Probabilistic TA-Sorted (Example)
o3 4 05/5= 81
v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67
TOP-K
O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
Iteration 1: o1 = [183, 423], o3 = [240,423]
O3, 405O1, 363
24Iteration 3: o1 = 363, o3 = 405
Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354]
o4 can be removed from candidate list if it is unlikely to be in the Top-2
13
Top-K processing using Views
• Query answering using views
[Das, Gunopulos, Koudas, Tsirogiannis, VLDB’06]
• Improved efficiency: – Use similar, previously instantiated queries– Use previous queries to model the
correlations between attributes
25
Top-K processing using Views
Ranking Views: Materialized results of previously asked top-k queries
321 523 XXXfQ ++=V1 tid ScorefV1 = 2X1 + 5X2
V2 tid Score
fV 2 = X2 + 4X3
R t X X X
Problem: Can we answer new top-k queriesefficiently using ranking views?
26
1 id
3 5534 3855 2162 2011 169
2 id
2 3511 2375 1773 1594 88
R tid X1 X2 X3
1 82 1 592 53 19 833 29 99 154 80 45 85 28 32 39
14
LPTA - Setting
• Linear additive scoring functions e.g. fQ = 3X1 + 2X2 + 5X3
• Set of Views: – Materialized result of the tuples of a
previously executed top-k query– Arbitrary subset of attributes– Sorted access on pairs tid,scoreQ tid( )( )
fQ 1 2 3
27
p• Random access on the base table R• Extends PREFER [Hristidis et al, 2001]
, Q( )( )
LPTA - Example
Top-1V1
Qstoppingditi
X1
tid11 s1
1
tid21
1
s21
tid12 s1
2
tid22
2
s22
V1 V2tid1
1
tid2
Top-1 QconditionR(X1, X2) RT(1,1)(0,1)
28
tid31
tid41
tid51
s31
s41
s51
tid32
tid42
tid52
s32
s42
s52
tid1V2
X2O P(0,0)
(1,0)
15
LPTA - Example (cont’)
Top-1 V1 QstoppingX1
tid11 s1
1
tid21
1
s21
1
tid12 s1
2
tid22
2
s22
2
V1 V2 tid11
tid2
tid21
Top 1 Qpp gconditionR(X1, X2) RT
(1,1)(0,1)
29
tid31
tid41
tid51
s31
s41
s51
tid32
tid42
tid52
s32
s42
s52
tid1tid22
V2
X2O P(0,0)
(1,0)
LPTA
Linear Programming adaptation of TA
R(X1,X2) Q: fQ = 3X1 +10X2
max( fQ )
0 ≤ X1,X2 ≤100
fV1 = 2X1 + 5X2 fV 2 = X1 + 2X2
V1 V2
tid Score tid Score
30
tidd1 sd
1 tidd2 sd
2
2X1 + 5X2 ≤ sd1
X2 + 2X2 ≤ sd2
unseenmax ≤ topkmin
d iteration
16
View Selection HeuristicsSelect Views By Angle (SVA): Sort the views by
increasing angle with respect to Q.
(0,1)
QSelected Views
31(1,0)(0,0)
View Selection: Cost Estimation Framework
• What is the cost of running LPTA on a specific set of views?
• Precise indicator of sequential and random• Precise indicator of sequential and random accesses
• Cost = number of sequential accesses
32
17
Simulation of LPTA using Histograms
HQ: approximates the score
1. Estimate the score of the k highest tuple.
2. Run LPTA in a bucket b b k t l k t
HQ HV1 HV2
topkmin
HQ: approximates the scoredistribution with respect to Q
33
by bucket lock step.3. Estimate the cost.
b bucketsn/b tuples per bucket
Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques
Di t ib t d t h i• Distributed techniques– Exact algorithms– Exact algorithms with fixed rounds of
communication: TPUT, TJA, TPAT– Approximate Algorithms Using data distribution
information: KLEE
34
– Exact algorithms using upper/lower bounds: LBK• Online Algorithms for Monitoring Top-K results• Future Work
18
Distributed Top-K Query ProcessingExample: Sensor monitoring• Consider n sensors S={s1,s2,…,sn} each of which
maintains a sliding window of m {o1 o2 o } readingsmaintains a sliding window of m {o1,o2,…,om} readings. Note: oij denotes the ith reading of the jth sensor.
• Given an n-dimensional query point Q = {q1,q2,…,qn}• Objective:Find the K timestamps with the maximum
value:
35
– wj : Sensor Weight. The readings of some sensors might be more important than other sensors
– Sim(qj , ojj) : A monotone Similarity Function.
Distributed Top-K Query ProcessingCost Metric in a Distributed Environment
A) Utilization of the Communication MediumTransmitting less data conserves resources energy and– Transmitting less data conserves resources, energy and minimizes failures.
– e.g. in a Sensor Network sending 1 byte ≈ 1120 CPU instructions.Source: The RISE (Riverside Sensor) (NetDB’05,
IPSN’05 Demo, IEEE SECON’05)
36
B) Query Response Time- The #bytes transmitted is not the only parameter.- Minimize the time to execute a query.
19
Communication Topologies
• Assume that the distributed sites are interconnected in a graph topology.
QNv1 v2 v1 v2
interconnected in a graph topology.– Example: Peer-to-peer or Sensor
networks
37
v1 v2 vn...
Star TopologyQN vnv3
Graph Topology
QN vnv3
Spanning Tree
Naïve Solution: Centralized Join (CJA)
• Each Node sends all its local scores (list) Each intermediate node
v1TOP-1
5:4:
3:2:
5:4:3:2:1:
1,2,3,4,5
• Each intermediate node forwards all received lists
• The Gnutella Approach v3
v2
v4
v5
5:4:
5:
3:
5:
38
Drawbacks• Overwhelming amount of messages.• Huge Query Response Time
20
Simple Solution: Staged Join (SJA)
• Aggregate the lists before these are forwarded to the parent
v1
2,3,4,5:
TOP-1
1,2,3,4,5
1,2,3,4,5
[Madden, Franklin, Hellerstein, Wong, OSDI’02]
are forwarded to the parent using:
• This is essentially the TAG h
v3
v2
v4
v5
5:
3: 4,5:
2,3 4,5
4 5
39
approach (Madden et al. OSDI '02)• Advantage: Only (n-1) messages
• Drawback: Still sending everything!
The Threshold Algorithm
Advantages:• The number of objects accessed is minimized• Marian et al show how to minimize random accesses
Why Not TA in a distributed Environment?Disadvantages:Each object is accessed individually (random accesses)
A huge number of round trips (phases)
• Marian et al show how to minimize random accesses[Marian, Bruno, Gravano, TODS’04]
40
A huge number of round trips (phases)
Unpredictable Latency (Phases are sequential)
In-network Aggregation not possible
21
The TPUT Algorithm[Cao and Wang, PODC’04]
TPUT is a 3-round algorithm:Improves query response time
TPUT (Three Phase Uniform Threshold)1) Fetch K first entries from all n lists. Define threshold τ
as τ = (Kth lowest partial score / n). τ (the uniform threshold) is then disseminated to all nodes.
41
2) Each node sends any pair which has a score above τ.3) If we found the complete score for less than K objects
then we perform a random access for all incomplete objects
The TPUT Algorithm: Example
v1 v2 v3 v4 v53 99 1 91 1 92 3 74 3 67
TOP-1
1 183 3 2403 405o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
P1 P2 P3Q: TOP-1
o1=183, o3=240o3=405o1=363o2’=158o4’=137o0’=124
42
Phase 1 : o1 = 91+92 = 183, o3 = 99+67+74 = 240
τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48Phase 2 : Have we computed K exact scores ?
Computed Exactly: [o3, o1] Incompletely Computed: [o4,o2,o0]Drawback: The threshold is too coarse (uniform)
22
Optimality?
Fagin et al (2003): TA is an instance optimal algorithm.
Cao et al (PODC’04): No Fixed Round Algorithm can be Instance Optimal (TPUT, TJA, etc)
43
Fixed Rounds => Constant Communication Overhead
Improving The TPUT Algorithm[Yu, Li, Wu, Agrawal, El Abbadi, DEXA’05]
In TPUT the threshold is uniform and too-coarseOne approach is to use statistics to set different
th h ld tt ib tthresholds per attributeWe need statistical information a priori
TPAT1) Fetch K first entries from all n lists. Define threshold τ
as τ = (Kth lowest partial score. 2) Partition τ (the uniform threshold) based on data
44
distribution, is then disseminate values to all nodes.3) Each node sends any pair which has a score above τ.4) If we found the complete score for less than K objects
then we perform a random access for all incomplete objects
23
Threshold Join Algorithm (TJA)
TJA is a 3-round algorithm:minimizes number of transmitted objects,performs in network aggregation
[Zeinalipour-Yiazti, Vagena, Gunopulos, Kalogeraki, Tsotras, Vlachos, Koudas, Srivastava, DMSN’05]
1. LB Phase: Ask each node to send the K (locally) highest ranked results. The union of these results defines a threshold τ .
performs in-network aggregation,optimizes the utilization of the communication channel
45
2. HJ Phase: Ask each node to transmit everything above this threshold τ .
3. CL Phase: identify the complete score of all incompletely calculated scores.
Step 1 - LB (Lower Bound) Phase• Each node sends its top-k
results to its parent.Each intermediate node
v1
2,3,4,5:
TJA1) LB Phase
U
U
1
1,2,3,4,5Ltotal{1,3}
• Each intermediate node performs a union of all received lists (denoted as τ):
v3
v2
v4
v5
5:
3: 4,5:
4U5
2,3U4,5
Occupied Oij
Empty Oij
v1 v2 v3 v4 v5 LBQuery: TOP-1
46
v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
LB
{o3, o1}
24
Step 2 – HJ (Hierarchical Join) Phase• Disseminate τ to all nodes • Each node sends back
everything with score above all
TJA2) HJ Phase
v1
2,3,4,5:
1,2,3,4,5Rtotal
{1,3,4}
U
U+
y gobjectIDs in τ.
• Before sending the objects, each node tags as incomplete,scores that could not be computed exactly (upper bound)
v3
v2
v4
v5
5:
3: 4,5:
4 5
2,3 4,5
Occupied Oij
Empty Oij
Incomplete Oij
U+
U+
1 2 3 4 5
47
o3, 405o1, 363o4',354
v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4,19
o3, 67o4, 67o1, 58o2, 54o0, 35
HJ
} Complete
Incomplete
Step 3 – CL (Cleanup) Phase
Have we found K objects with a complete score?Yes: The answer has been found!No: Find the complete score for each incomplete object (all in a single batch phase)
• CL ensures correctness• This phase is rarely required in practice.
1 2 3 4 5
48
o3, 405o1, 363o4, 207o0, 188o2, 175
v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44
o1, 91o3, 90o0, 61o4, 07o2, 01
o1, 92o3, 75o4, 70o2, 16o0, 01
o3, 74o1, 56o2, 56o0, 28o4, 19
o3, 67o4, 67o1, 58o2, 54o0, 35
TOP-5o3,405
25
TJA vs. TPUT
1e+09
Bytes Required for Distributed Top-K Algorithms (Star Topology, K=5, m=25K)
SJA
100000
1e+06
1e+07
1e+08
Byt
es R
equi
red
TPUT
TJA
49
1000
10000
20 30 40 50 60 70 80 90 100
n (Number of Nodes)
Approximate Distributed Algorithms
• TA performs many communication rounds• TPUT may retrieve a lot of data in Phase 2• TPUT, TJA perform random accesses
All these characteristics hurt performance!
50
26
Approximate algorithms: KLEE
• TPUT may retrieve a lot of data in Phase 2• TPUT TJA perform random accesses
[Michel, Triantafillou, Weikum, VLDB’05]
• TPUT, TJA perform random accesses
KLEE is an improvement on both counts:Focuses on Approximate answersUses information about the data distribution to
d d t t f
51
reduce data transfersDoes not do random accesses at each peer
The KLEE AlgorithmKLEE is a 2 or 3-round algorithm:
1. Exploration Step: finds an approximation of min-k score threshold using histograms and bloom filters
2. Optimization Step: decides if step 3 will be executed (NO communication)
3. Candidate Filtering: a docID is a good candidate if high-scored in many peers.
52
g y p4. Candidate Retrieval: get all good docID
candidates.
27
Histogram Bloom Structure• Each node pre-computes per attribute:1. an equi-width histogram,2 a Bloom filter for each
score
#doc
s
011
001
011
001
001
001
011
001
011
001
10
2. a Bloom filter for each histogram cell,3. the average score per cell,4. upper/lower scores per cell
53
100
101
100
101
110
101
110
101
100
101
*From the VLDB’05 KLEE presentation
Top-K Algorithms that use Score Bounds
• Suppose that each Node can only return Lower and Upper Bounds rather than Exact scores
[Marian, Bruno, Gravano, TODS’04][Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]
Upper Bounds rather than Exact scores. • e.g. instead of 16 it tells us that the similarity is in the
range [11..19]
A2,3,6A0,4,8
A4,5,10
A4,10,18A2,13,19A0,15,25
m
A4,4,5A2,5,6A0,5,7
A4,1,3A0,6,10A2,5,7
id,lb,ubv3
id,lb,ubv2
id,lb,ubv1
id,lb,ubMETADATA
trajectories
A2
A1
y
cell
Q
54
A7,7,9A3,8,11A9,8,9
....
A3,20,27A9,22,26A7,30,35
....
m A3,5,6A9,8,10
A7,12,13....
A9,6,7A3,7,10
A7,11,13....
nG
xAccess Pointmoving object
28
LB-K Algorithm
• An iterative algorithm for finding the K highest ranked
[Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]
• An iterative algorithm for finding the K highest ranked DATA objects using Lower Bounds (METADATA objects).
• Strategy: Utilize the METADATA objects in order to decide which
55
DATA objects have to be transferred.
LB-K: Example
id,lbid,lb DATAMETADATA
Query: Find the K=2 highest ranked answers
A4,10A2,13A0,15A3,20A9,22
A4,17A2,18A0,24A3,22A9,25
≥ ?
≥ ?K+1
TJAK
2K
2K+1
TJA
56
A7,30....
A7,33....
LB EXACT
29
UBLB-K Algorithm
• Also an iterative algorithm with the same objectives as LBK Diff• Differences:
– It uses both a lower (LB) and upper (UB) bound on the distributed DATA.
– It transfers the candidate DATA objectsin a final (bulk) phase rather than
57
in a final (bulk) phase rather than incrementally
UBLB-K: Example
A 4 10 18
id ,lbid ,lb,ubM ETA D A TA
Exact Score
A 4 17
D A TA
A 4,10,18A 2,13,19A 0,15,25A 3,20,27A 9,22,26A 7 30 35
A 4,17A 2,18A 0,24A 3,22A 9,25A 7,33
≥ ?
≥ ?K+1
TJA
2K+1
TJA
58
A 7,30,35....
LB ,U B
A 7,33....
EX A C TNote: Kth lowest UB is: 19Therefore A3 (LB:20) and below are not necessary
30
Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques
– Fagin’s AlgorithmOptimal Algorithms: TA (Threshold Algorithm)– Optimal Algorithms: TA (Threshold Algorithm)
– Restricted Access Models: TA-Sorted– Probabilistic TA-Sorted– Using previous query instantiations: LPTA
• Distributed techniques– Exact algorithms– Exact algorithms with fixed rounds of communication: TPUT, TJA,
TPAT
59
– Approximate Algorithms Using data distribution information: KLEE
– Exact algorithms using upper/lower bounds: LBK• Online Algorithms for Monitoring Top-K results: BABOLS, TMA• Future Work
Online algorithms
• Top-K monitoring for tid11
Q1X1RT
(1,1)(0,1)
[Mouratidis, Bakiras, Papadias, SIGMOD’06]
Top K monitoring for stream data:The TMA/SMA algorithms
• Monitor multiple Top-K queries simultaneously– Efficiently identify the effect
1
tid12
(10)
( , )( , )
60
Efficiently identify the effect of changes:Only some Top-K results change
Q2O P(0,0)
(1,0)
31
Monitoring Top-K Results in a Distributed Setting
• The setting: changes come over time • Use any efficient algorithm for finding top-K
[Babcock, Olston, SIGMOD’03]
Use any efficient algorithm for finding top K• Monitor changes: only if large changes happen
you have to recompute– Approximate algorithm: top-K results are correct within ε– Need an algorithm that decides how big a change can
we tolerate per node:
61
Query: 2 X1 + X2
Tuples: t1 = (3,9), t2 = (5,1)Slack per attribute is:
((Score(t1) – Score(t2)) + ε )/2 = 2+ε/2
Conclusions• Top-K Query Processing is an area with
– Many applications in practical problemsMany challenges and opportunities!– Many challenges and opportunities!
• Privacy issues• Approximate algorithms• Online algorithms
62
• Modeling and exploiting correlations
32
References
– Amato G.,Rabitti F.,Savino P. and Zezula P., “Region proximity in metric spaces and its use for approximate similarity search”, In TOIS 20032003.
– Babcock B. and Olston C., “Distributed Top-K Monitoring”, In Proceedings of the ACM SIGMOD international conference on Management of data, San Diego, CA, USA, Pages 28-39, 2003.
– Balke W.-T., Nejdl W., Siberski W., Thaden U., “Progressive Distributed Top-K Retrieval in Peer-to-Peer Networks”, In Proceedings of the 21st International Conference on Data Engineering, April 5-8, Tokyo, Japan, 2005. Banerjee A Mitra A Najjar W Zeinalipour Yazti D Kalogeraki V and
63
– Banerjee A.,Mitra A.,Najjar W.,Zeinalipour-Yazti D., Kalogeraki V. and Gunopulos D., ”RISE Co-S : High Performance Sensor Storage and Co-Processing Architecture”, Second Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks,(SECON’2005), Santa Clara, California, USA, 2005.
References
– Bloom B. H., “Space/Time Trade-Offs in Hash Coding with Allowable Errors”, Communication of the ACM, 13(7):422-426, 1970.
– Bruno N.,Gravano L. and Marian A., “EvaluatingTop-K Queries Over Web Accessible Databases”, In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, Page 369, 2002.
– Cao P. and Wang Z., “Efficient Top-K Query Calculation in Distributed Networks”, In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing,St.John’s,Newfoundland, Canada, Pages 206-215, 2004.
– Chun B.N., Culler D.E, Roscoe T., Bavier A.C, Peterson L.L, Wawrzoniak M., Bowman M., “PlanetLab: an overlay testbed for broad-
64
coverage services”, Computer Communication Review Volume33, Issue 3, Pages 3-12, 2003.
– Claffy K., Tracie E., McRobb D. ”Internet tomography”, 1999.
33
References– Considine J., Li F., Kollios G., and Byers J., “Approximate Aggregation
Techniques for Sensor Databases”, In Proceedings of the 20th International Conference on Data Engineering, Boston, MA, USA, Page 449 2004449, 2004.
– Deligiannakis A., Kotidis Y. Roussopoulos N., “Hierarchical in-Network Data Aggregation with Quality Guarantees”, In 9th International Conference on Extending Database Technology, Heraklion, Greece, March 14-18, Pages 658-675, 2004.
– Donjerkovic D. and Ramakrishnan R., “Probabilistic Optimization of Top-N Queries”, In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Pages 411-422, 1999
65
1999. – Fagin R., “Combining Fuzzy Information from Multiple Systems”, In
Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Montreal, Quebec, Canada, Pages 216-226, 1996.
References– Fagin R., “Fuzzy Queries In Multimedia Database Systems”, In
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, WA, USA, pp. 1-10 19981 10, 1998.
– Fagin R., Lotem A. and Naor M., “Optimal Aggregation Algorithms For Middleware”, In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, CA, USA, Pages 102-113, 2001.
– Gravano L. and Chaudhuri S., “Evaluating Top-K Selection Queries”, In Proceedings of the 25th International Conference on Very Large DataBases, Edinburgh, Scotland, UK, Pages 397-410, 1999. Guntzer U Balke W Kiebling W “Optimizing Multi Feature Queries for
66
– Guntzer U.,Balke W.,Kiebling W., “Optimizing Multi Feature Queries for Image Databases”, In VLDB 2000.
– Hansen, T.,Otero, J.,McGregor, A.,Braun, H-W., “Active measurement data analysis techniques”, In Proceedings of the International Conference on Communications in Computing, Las Vegas, Nevada, pp.105,2000.
34
References– Ilyas I.F., Aref W.G. and Elmagarmid A.K., “Supporting Top-k Join
Queries in Relational Databases”, In The VLDBJournal –The International Journal on Very Large Data Bases, Vol. 13 , Iss. 3, pp. 207-221 2003207 221, 2003.
– Kalnis P., Ng W-S., Ooi B-C., Tan K-L., “Answering similarity queries in peer-to-peer networks”, In Proceedings of the 14th International World Wide Web Conference, Pages 482-483, New York City, NY, USA, 2004.
– Kiessling W., “Foundations of Preferences in Database Systems”, InProceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, Pages 311-322, 2002.
– Lv Q., Cao P., Cohen E., Lai K., Shenker S., “Search and Replication in Unstructured Peer to Peer Networks” In Proceedings of the 16th
67
Unstructured Peer-to-Peer Networks”, In Proceedings of the 16th international conference on Supercomputing, New York,NY,USA,Pages 84-95, 2002.
– Nepal S., Ramakrishna M. V., “Query Processing Issues in Image(Multimedia) Databases”, In ICDE 1999.
References– Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., “TAG: a Tiny
AGgregation Service for Ad-Hoc Sensor Networks”, In Proceedings of the 5th symposium on Operating systems design and implementation, Boston MA pp 131-146 2002Boston, MA, pp. 131 146, 2002.
– Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., ”The Design of an Acquisitional Query Processor for Sensor Networks”, In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, CA, USA, Pages 491-502, 2003.
– Marian A., Gravano L., Bruno N., “ Evaluating Top-k Queries over Web-Accessible Databases”, In TODS 2004.
– Michel S., Triantafillou P., Weikum G., “KLEE: A Framework for Distributed Top K Query Algorithms” In 31st conference in the series of
68
Distributed Top-K Query Algorithms”, In 31st conference in the series of the Very Large Data Bases, Trondheim, Norway, 2005.
– Nejdl W., Siberski W., Thaden U. and Balke W., “Top-k Query Evaluation for Schema-Based Peer-to-Peer Networks”, In ISWC 2004.
35
References– Szewczyk R., Osterweil E., Polastre J., Hamilton M., Mainwaring A.M.,
Estrin D., “Habitat monitoring with sensor networks”, Commun.ACM47(6):34-40(2004). Theobald M Schenkel R Weikum G “Top k Query Evaluation with– Theobald M., Schenkel R., Weikum G., Top-k Query Evaluation with Probabilistic Guarantees”, In VLDB 2004.
– Tsoumakos D. and Roussopoulos N., “Adaptive Probabilistic Search for Peer-to-Peer Networks”, In Proceedings of the Third International Conference on Peer-to-Peer Computing, Linkoping, Sweden, Pages 102-110, 2003.
– Xiong L., Chitti S., Liu L., “Top-k Queries across Multiple Private Databases”, In ICDCS 2005.
69
– Yang B. and Garcia-Molina H., “Efficient Search in Peer-to-Peer Networks”. In Proceedings of the 22nd International Conference on Distributed Computing Systems Vienna, Austria, Pages 5-14, 2002.
References– Zeinalipour-Yazti D., Vagena Z., Gunopulos D., Kalogeraki V., Tsotras
V., Vlachos M., Koudas N., Srivastava D., “The Threshold Join Algorithm for Top-K Queries in Distributed Sensor Networks”, In Proceedings of the 2nd International Workshop on Data ManagementProceedings of the 2nd International Workshop on Data Management for Sensor Networks, collocated with VLDB 2005, Trondheim, Norway, 2005.
– Zeinalipour-Yazti D., Lin S., Kalogeraki V., Gunopulos D., Najjar W., ”MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices”, In Proceedings of the 4th USENIX Conference on File and Storage Technologies(FAST’2005) SanFrancisco,CA,December14-16, pp. 31-44, 2005.
70