top-k query processing · 2016. 10. 23. · 4 top-k query example • assume that we have a cluster...

35
1 Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90’s): How to store and index multimedia objects? How to store and index multimedia objects? Multimedia objects can have many attributes, with numerical, “fuzzy”, values e.g. Figure1: 0.7 red, Figure2: 0.4 red, same for blue, etc 2 How to find similar objects?

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

1

Top-K Query Processing

D. Gunopulos

1

Multimedia Top-K QueriesThe IBM QBIC project (90’s):

How to store and index multimedia objects?How to store and index multimedia objects?

• Multimedia objects can have many attributes,with numerical, “fuzzy”, valuese.g. Figure1: 0.7 red, Figure2: 0.4 red, same for blue, etc

2

• How to find similar objects?

Page 2: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

2

Retrieving Multimedia ObjectsMust address the similarity question• The user specifies the query as a function of

the attributes:Find the objects most similar to:

(Red = 50), (Green = 20), (Blue = 30)with Red twice as important as Green or Blue

How to rank the objects?

3

How to rank the objects?• Must find the best (most similar) objects

without accessing everything

The right model: Top-K Queries!

Data Management & Query Processing Today

4

We are living in a world where data is generatedAll The Time & Everywhere

Page 3: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

3

Characteristics of the Applications• “Data is generated in a distributed

fashion” e.g. sensor data, file-sharing data, Geographically Distributed Clusters)Geographically Distributed Clusters)

• “Distributed Data is often outdated before it is ever used”(e.g. CCTV video traces, Internet ping data, sensor readings, weblogs, RFID Tags,…)

5

• “Transferring the Data to a centralized repository is usually more expensive than storing it locally”

Motivating Problems• “In-situ Data Storage & Retrieval”

– Data remains in-situ (at the generating site).When users want to search/retrieve some– When users want to search/retrieve some information they perform on-demand queries.

• Challenges:– Combine different attributes and data sources– Minimize the utilization of the communication

di

6

medium – Exploit the network and the inherent parallelism

of a distributed environment. Focus on ubiquitous Hierarchical Networks (e.g. P2P, and sensor-nets).

– Number of Answers might be very large Focus on Top-K queries

Page 4: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

4

Top-K Query Example• Assume that we have a cluster of n=5

webservers.• Each server maintains locally the same m=5 y

webpages.• When a web page is accessed by a client, a

server increases a local hit counter by one.

7

Top-K Query Example

• TOP-1 Query: “Which Webpage has the highest number of hits across all servers (i.e. highest S ( ) )?”Score(oi) )?”

• Score(oi) can only be calculated if we combine the hit count from all 5 servers.

Local score

8

{m

n

URL

TOTAL SCORE

Page 5: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

5

Top-K Query ProcessingOther Applications

C ll b ti S D t ti N t k• Collaborative Spam Detection Networks• Content Distribution Networks• Information Retrieval• Sensor Networks

9

Top-K Query Processing Setting

Setting for this talk:• Vertical partitioning:

Independent accessIndependent access for each attribute (or sets of attributes)

• Assume index or sorted access per attribute

10

attribute• Centralized or

distributed setting

Page 6: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

6

Top-K Query Processing Setting

What kinds of queries?• Monotone functions of the attributes:

f( ) f( ) if– f(x11, x21,x31) >= f(x12,x22,x32) if x11 >= x12 and x21>= x22 and x31 >= x32

• Typically assume linear functions

21 23 XXfQ +=tid1

1

QX1RT (1,1)(0,1)

11X2

O P(0,0)

(1,0)

Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques

F i ’ Al ith– Fagin’s Algorithm– Optimal Algorithms: TA (Threshold Algorithm)– Restricted Access Models: TA-Sorted– Probabilistic TA-Sorted– Using previous query instantiations: LPTA

• Distributed techniques

12

• Distributed techniques• Online Algorithms for Monitoring Top-K results• Future Work

Page 7: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

7

Fagin’s Algorithm

[Fagin, PODS’98], [Fagin, Lotem, Naor, PODS’01]

The first efficient algorithm

FΑ Algorithm1) Access the n lists in parallel.2) Stop after the values of K objects have been found

Assumes an index per attribute

13

3) While some object oi is seen, but not resolved, perform a random access to the other lists to find the complete score for oi. 4) Return the K objects with the highest score

Fagin’s Algorithm (Example)

o3 4 05/5= 81

v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67

TOP-K

O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

Iteration 1

O3, 405O1, 363O4, 207

14

Iteration 2

Resolve o3, partial scores for o1, o4For Top-1, we resolve o1, o4 with random accesses

For Top-2, we continue with Iteration 3

Partial Scores for o1, 03

Page 8: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

8

Fagin’s* Threshold Algorithm

Long studied and well understood.* C tl d l d b 3

[Fagin, Lotem, Naor, PODS’01], [Guntzer, Balke, Kieling, VLDB’00][Nepal, Ramakrishna, ICDE’99]

* Concurrently developed by 3 groups

ΤΑ Algorithm1) Access the n lists in parallel.2) While some object oi is seen, perform a random access to the other lists to find the complete score for oi. 3) Do the same for all objects in the current row

15

3) Do the same for all objects in the current row.4) Now compute the threshold τ as the sum of scores in the current row.5)The algorithm stops after K objects have been found with a score above τ.

The Threshold Algorithm (Example)

o3 4 05/5= 81

v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67

TOP-K

O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

Iteration 1 Threshold

O3, 405O1, 363O4, 207

16

Have we found K=1 objects with a score above τ? => ΝΟ

Have we found K=1 objects with a score above τ? => YES!

τ = 99 + 91 + 92 + 74 + 67 => τ = 423

Iteration 2 Thresholdτ (2nd row)= 66 + 90 + 75 + 56 + 67 => τ = 354

Page 9: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

9

Comparison of Fagin’s and Threshold Algorithm

• TA sees less objects than FA

• TA may perform more random accesses than FA

• TA requires only bounded buffer space (k)• At the expense of more random seeks• FA makes use of unbounded buffers

17

FA makes use of unbounded buffers

Optimal algorithms

Algorithm B is instance optimal over

set of algorithms A and set of inputs D if :

B E A and Cost(B,D ) = O(Cost(A,D )) A E A, D E D

Which means that:

Cost(B,D ) ≤ c · Cost(A,D ) + c’, A E A, D E D

Theorem [Fagin et. al. 2003]: TA is instance optimal for every monotone

18

TA is instance optimal for every monotoneaggregation function, over every database(excluding wild guesses)

Page 10: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

10

TA-Sorted

TA makes random accesses, assumes they are possible and inexpensive

In many situations, random accesses are muchIn many situations, random accesses are much more expensive than sequential access, or may be difficult to implement.

TA-sorted uses sequential access onlyAssumes sorted access for each attribute

ΤΑ-Sorted Algorithm

19

ΤΑ Sorted Algorithm1) Access the n lists in parallel.2) When some object oi is seen,

update its Upper and Lower bound. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.

The TA-Sorted Algorithm (Example)

o3 4 05/5= 81

v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67

TOP-K

O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

Iteration 1: o1 = [183, 423], o3 = [240,423]

O3, 405O1, 363

20

Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354]

Iteration 3: o1 = 363, o3 = 405, o4 = [137, 317]

Page 11: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

11

TA-Sorted and Relational Systems

The sorted access makes it easier and

[Ilyas, Aref, Elmagarmid, VLDB’03][Tsaparas, Palpanas, Kotidis, Koudas, Srivastava, ICDE’03][Bruno, Chauduri, Gravano, ICDE’01]

The sorted access makes it easier and conceptually simpler to integrate to relational Database Management systems:

Ilyas et al preset the RNA-RJ Rank-Join Query operator

tid11

QX1RT (1,1)(0,1)

21

Bruno et al show that Top-K queries can be reduced to multidimensional range queries(using histograms to model

the data distribution) X2O P(0,0)

(1,0)

Probabilistic TA-Sorted

TA-Sorted can keep large intermediary resultsA smart idea: Use information about the

[Theobald, Weikum, Scenkel, VLDB’04]

distribution of the values to eliminate objects that are unlikely to be in the Top-K result

The key: compute probabilistic guarantees

Probabilistic ΤΑ-Sorted

22

1) Access the n lists in parallel.2) When some object oi is seen, update it’s Upper and Lower. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound.

Page 12: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

12

Probabilistic TA-SortedHow to compute probabilistic guarantees?Need an estimate for the score of unseen objects• Assume attribute independenceAssume attribute independence• Compute a model of the distribution in each

attribute Xi• Use convolution to estimate the distribution of

X1 + X2X1

X1+X2

23

Use histograms to model data distributions:

X2

Probabilistic TA-Sorted (Example)

o3 4 05/5= 81

v1 v2 v3 v4 v5o3 99 o1 91 o1 92 o3 74 o3 67

TOP-K

O3 405o3,4.05/5 .81o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

Iteration 1: o1 = [183, 423], o3 = [240,423]

O3, 405O1, 363

24Iteration 3: o1 = 363, o3 = 405

Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354]

o4 can be removed from candidate list if it is unlikely to be in the Top-2

Page 13: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

13

Top-K processing using Views

• Query answering using views

[Das, Gunopulos, Koudas, Tsirogiannis, VLDB’06]

• Improved efficiency: – Use similar, previously instantiated queries– Use previous queries to model the

correlations between attributes

25

Top-K processing using Views

Ranking Views: Materialized results of previously asked top-k queries

321 523 XXXfQ ++=V1 tid ScorefV1 = 2X1 + 5X2

V2 tid Score

fV 2 = X2 + 4X3

R t X X X

Problem: Can we answer new top-k queriesefficiently using ranking views?

26

1 id

3 5534 3855 2162 2011 169

2 id

2 3511 2375 1773 1594 88

R tid X1 X2 X3

1 82 1 592 53 19 833 29 99 154 80 45 85 28 32 39

Page 14: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

14

LPTA - Setting

• Linear additive scoring functions e.g. fQ = 3X1 + 2X2 + 5X3

• Set of Views: – Materialized result of the tuples of a

previously executed top-k query– Arbitrary subset of attributes– Sorted access on pairs tid,scoreQ tid( )( )

fQ 1 2 3

27

p• Random access on the base table R• Extends PREFER [Hristidis et al, 2001]

, Q( )( )

LPTA - Example

Top-1V1

Qstoppingditi

X1

tid11 s1

1

tid21

1

s21

tid12 s1

2

tid22

2

s22

V1 V2tid1

1

tid2

Top-1 QconditionR(X1, X2) RT(1,1)(0,1)

28

tid31

tid41

tid51

s31

s41

s51

tid32

tid42

tid52

s32

s42

s52

tid1V2

X2O P(0,0)

(1,0)

Page 15: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

15

LPTA - Example (cont’)

Top-1 V1 QstoppingX1

tid11 s1

1

tid21

1

s21

1

tid12 s1

2

tid22

2

s22

2

V1 V2 tid11

tid2

tid21

Top 1 Qpp gconditionR(X1, X2) RT

(1,1)(0,1)

29

tid31

tid41

tid51

s31

s41

s51

tid32

tid42

tid52

s32

s42

s52

tid1tid22

V2

X2O P(0,0)

(1,0)

LPTA

Linear Programming adaptation of TA

R(X1,X2) Q: fQ = 3X1 +10X2

max( fQ )

0 ≤ X1,X2 ≤100

fV1 = 2X1 + 5X2 fV 2 = X1 + 2X2

V1 V2

tid Score tid Score

30

tidd1 sd

1 tidd2 sd

2

2X1 + 5X2 ≤ sd1

X2 + 2X2 ≤ sd2

unseenmax ≤ topkmin

d iteration

Page 16: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

16

View Selection HeuristicsSelect Views By Angle (SVA): Sort the views by

increasing angle with respect to Q.

(0,1)

QSelected Views

31(1,0)(0,0)

View Selection: Cost Estimation Framework

• What is the cost of running LPTA on a specific set of views?

• Precise indicator of sequential and random• Precise indicator of sequential and random accesses

• Cost = number of sequential accesses

32

Page 17: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

17

Simulation of LPTA using Histograms

HQ: approximates the score

1. Estimate the score of the k highest tuple.

2. Run LPTA in a bucket b b k t l k t

HQ HV1 HV2

topkmin

HQ: approximates the scoredistribution with respect to Q

33

by bucket lock step.3. Estimate the cost.

b bucketsn/b tuples per bucket

Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques

Di t ib t d t h i• Distributed techniques– Exact algorithms– Exact algorithms with fixed rounds of

communication: TPUT, TJA, TPAT– Approximate Algorithms Using data distribution

information: KLEE

34

– Exact algorithms using upper/lower bounds: LBK• Online Algorithms for Monitoring Top-K results• Future Work

Page 18: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

18

Distributed Top-K Query ProcessingExample: Sensor monitoring• Consider n sensors S={s1,s2,…,sn} each of which

maintains a sliding window of m {o1 o2 o } readingsmaintains a sliding window of m {o1,o2,…,om} readings. Note: oij denotes the ith reading of the jth sensor.

• Given an n-dimensional query point Q = {q1,q2,…,qn}• Objective:Find the K timestamps with the maximum

value:

35

– wj : Sensor Weight. The readings of some sensors might be more important than other sensors

– Sim(qj , ojj) : A monotone Similarity Function.

Distributed Top-K Query ProcessingCost Metric in a Distributed Environment

A) Utilization of the Communication MediumTransmitting less data conserves resources energy and– Transmitting less data conserves resources, energy and minimizes failures.

– e.g. in a Sensor Network sending 1 byte ≈ 1120 CPU instructions.Source: The RISE (Riverside Sensor) (NetDB’05,

IPSN’05 Demo, IEEE SECON’05)

36

B) Query Response Time- The #bytes transmitted is not the only parameter.- Minimize the time to execute a query.

Page 19: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

19

Communication Topologies

• Assume that the distributed sites are interconnected in a graph topology.

QNv1 v2 v1 v2

interconnected in a graph topology.– Example: Peer-to-peer or Sensor

networks

37

v1 v2 vn...

Star TopologyQN vnv3

Graph Topology

QN vnv3

Spanning Tree

Naïve Solution: Centralized Join (CJA)

• Each Node sends all its local scores (list) Each intermediate node

v1TOP-1

5:4:

3:2:

5:4:3:2:1:

1,2,3,4,5

• Each intermediate node forwards all received lists

• The Gnutella Approach v3

v2

v4

v5

5:4:

5:

3:

5:

38

Drawbacks• Overwhelming amount of messages.• Huge Query Response Time

Page 20: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

20

Simple Solution: Staged Join (SJA)

• Aggregate the lists before these are forwarded to the parent

v1

2,3,4,5:

TOP-1

1,2,3,4,5

1,2,3,4,5

[Madden, Franklin, Hellerstein, Wong, OSDI’02]

are forwarded to the parent using:

• This is essentially the TAG h

v3

v2

v4

v5

5:

3: 4,5:

2,3 4,5

4 5

39

approach (Madden et al. OSDI '02)• Advantage: Only (n-1) messages

• Drawback: Still sending everything!

The Threshold Algorithm

Advantages:• The number of objects accessed is minimized• Marian et al show how to minimize random accesses

Why Not TA in a distributed Environment?Disadvantages:Each object is accessed individually (random accesses)

A huge number of round trips (phases)

• Marian et al show how to minimize random accesses[Marian, Bruno, Gravano, TODS’04]

40

A huge number of round trips (phases)

Unpredictable Latency (Phases are sequential)

In-network Aggregation not possible

Page 21: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

21

The TPUT Algorithm[Cao and Wang, PODC’04]

TPUT is a 3-round algorithm:Improves query response time

TPUT (Three Phase Uniform Threshold)1) Fetch K first entries from all n lists. Define threshold τ

as τ = (Kth lowest partial score / n). τ (the uniform threshold) is then disseminated to all nodes.

41

2) Each node sends any pair which has a score above τ.3) If we found the complete score for less than K objects

then we perform a random access for all incomplete objects

The TPUT Algorithm: Example

v1 v2 v3 v4 v53 99 1 91 1 92 3 74 3 67

TOP-1

1 183 3 2403 405o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

P1 P2 P3Q: TOP-1

o1=183, o3=240o3=405o1=363o2’=158o4’=137o0’=124

42

Phase 1 : o1 = 91+92 = 183, o3 = 99+67+74 = 240

τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48Phase 2 : Have we computed K exact scores ?

Computed Exactly: [o3, o1] Incompletely Computed: [o4,o2,o0]Drawback: The threshold is too coarse (uniform)

Page 22: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

22

Optimality?

Fagin et al (2003): TA is an instance optimal algorithm.

Cao et al (PODC’04): No Fixed Round Algorithm can be Instance Optimal (TPUT, TJA, etc)

43

Fixed Rounds => Constant Communication Overhead

Improving The TPUT Algorithm[Yu, Li, Wu, Agrawal, El Abbadi, DEXA’05]

In TPUT the threshold is uniform and too-coarseOne approach is to use statistics to set different

th h ld tt ib tthresholds per attributeWe need statistical information a priori

TPAT1) Fetch K first entries from all n lists. Define threshold τ

as τ = (Kth lowest partial score. 2) Partition τ (the uniform threshold) based on data

44

distribution, is then disseminate values to all nodes.3) Each node sends any pair which has a score above τ.4) If we found the complete score for less than K objects

then we perform a random access for all incomplete objects

Page 23: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

23

Threshold Join Algorithm (TJA)

TJA is a 3-round algorithm:minimizes number of transmitted objects,performs in network aggregation

[Zeinalipour-Yiazti, Vagena, Gunopulos, Kalogeraki, Tsotras, Vlachos, Koudas, Srivastava, DMSN’05]

1. LB Phase: Ask each node to send the K (locally) highest ranked results. The union of these results defines a threshold τ .

performs in-network aggregation,optimizes the utilization of the communication channel

45

2. HJ Phase: Ask each node to transmit everything above this threshold τ .

3. CL Phase: identify the complete score of all incompletely calculated scores.

Step 1 - LB (Lower Bound) Phase• Each node sends its top-k

results to its parent.Each intermediate node

v1

2,3,4,5:

TJA1) LB Phase

U

U

1

1,2,3,4,5Ltotal{1,3}

• Each intermediate node performs a union of all received lists (denoted as τ):

v3

v2

v4

v5

5:

3: 4,5:

4U5

2,3U4,5

Occupied Oij

Empty Oij

v1 v2 v3 v4 v5 LBQuery: TOP-1

46

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

LB

{o3, o1}

Page 24: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

24

Step 2 – HJ (Hierarchical Join) Phase• Disseminate τ to all nodes • Each node sends back

everything with score above all

TJA2) HJ Phase

v1

2,3,4,5:

1,2,3,4,5Rtotal

{1,3,4}

U

U+

y gobjectIDs in τ.

• Before sending the objects, each node tags as incomplete,scores that could not be computed exactly (upper bound)

v3

v2

v4

v5

5:

3: 4,5:

4 5

2,3 4,5

Occupied Oij

Empty Oij

Incomplete Oij

U+

U+

1 2 3 4 5

47

o3, 405o1, 363o4',354

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4,19

o3, 67o4, 67o1, 58o2, 54o0, 35

HJ

} Complete

Incomplete

Step 3 – CL (Cleanup) Phase

Have we found K objects with a complete score?Yes: The answer has been found!No: Find the complete score for each incomplete object (all in a single batch phase)

• CL ensures correctness• This phase is rarely required in practice.

1 2 3 4 5

48

o3, 405o1, 363o4, 207o0, 188o2, 175

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

TOP-5o3,405

Page 25: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

25

TJA vs. TPUT

1e+09

Bytes Required for Distributed Top-K Algorithms (Star Topology, K=5, m=25K)

SJA

100000

1e+06

1e+07

1e+08

Byt

es R

equi

red

TPUT

TJA

49

1000

10000

20 30 40 50 60 70 80 90 100

n (Number of Nodes)

Approximate Distributed Algorithms

• TA performs many communication rounds• TPUT may retrieve a lot of data in Phase 2• TPUT, TJA perform random accesses

All these characteristics hurt performance!

50

Page 26: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

26

Approximate algorithms: KLEE

• TPUT may retrieve a lot of data in Phase 2• TPUT TJA perform random accesses

[Michel, Triantafillou, Weikum, VLDB’05]

• TPUT, TJA perform random accesses

KLEE is an improvement on both counts:Focuses on Approximate answersUses information about the data distribution to

d d t t f

51

reduce data transfersDoes not do random accesses at each peer

The KLEE AlgorithmKLEE is a 2 or 3-round algorithm:

1. Exploration Step: finds an approximation of min-k score threshold using histograms and bloom filters

2. Optimization Step: decides if step 3 will be executed (NO communication)

3. Candidate Filtering: a docID is a good candidate if high-scored in many peers.

52

g y p4. Candidate Retrieval: get all good docID

candidates.

Page 27: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

27

Histogram Bloom Structure• Each node pre-computes per attribute:1. an equi-width histogram,2 a Bloom filter for each

score

#doc

s

011

001

011

001

001

001

011

001

011

001

10

2. a Bloom filter for each histogram cell,3. the average score per cell,4. upper/lower scores per cell

53

100

101

100

101

110

101

110

101

100

101

*From the VLDB’05 KLEE presentation

Top-K Algorithms that use Score Bounds

• Suppose that each Node can only return Lower and Upper Bounds rather than Exact scores

[Marian, Bruno, Gravano, TODS’04][Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]

Upper Bounds rather than Exact scores. • e.g. instead of 16 it tells us that the similarity is in the

range [11..19]

A2,3,6A0,4,8

A4,5,10

A4,10,18A2,13,19A0,15,25

m

A4,4,5A2,5,6A0,5,7

A4,1,3A0,6,10A2,5,7

id,lb,ubv3

id,lb,ubv2

id,lb,ubv1

id,lb,ubMETADATA

trajectories

A2

A1

y

cell

Q

54

A7,7,9A3,8,11A9,8,9

....

A3,20,27A9,22,26A7,30,35

....

m A3,5,6A9,8,10

A7,12,13....

A9,6,7A3,7,10

A7,11,13....

nG

xAccess Pointmoving object

Page 28: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

28

LB-K Algorithm

• An iterative algorithm for finding the K highest ranked

[Zeinalipour-Yiazti, Lin, Gunopulos, CIKM’06]

• An iterative algorithm for finding the K highest ranked DATA objects using Lower Bounds (METADATA objects).

• Strategy: Utilize the METADATA objects in order to decide which

55

DATA objects have to be transferred.

LB-K: Example

id,lbid,lb DATAMETADATA

Query: Find the K=2 highest ranked answers

A4,10A2,13A0,15A3,20A9,22

A4,17A2,18A0,24A3,22A9,25

≥ ?

≥ ?K+1

TJAK

2K

2K+1

TJA

56

A7,30....

A7,33....

LB EXACT

Page 29: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

29

UBLB-K Algorithm

• Also an iterative algorithm with the same objectives as LBK Diff• Differences:

– It uses both a lower (LB) and upper (UB) bound on the distributed DATA.

– It transfers the candidate DATA objectsin a final (bulk) phase rather than

57

in a final (bulk) phase rather than incrementally

UBLB-K: Example

A 4 10 18

id ,lbid ,lb,ubM ETA D A TA

Exact Score

A 4 17

D A TA

A 4,10,18A 2,13,19A 0,15,25A 3,20,27A 9,22,26A 7 30 35

A 4,17A 2,18A 0,24A 3,22A 9,25A 7,33

≥ ?

≥ ?K+1

TJA

2K+1

TJA

58

A 7,30,35....

LB ,U B

A 7,33....

EX A C TNote: Kth lowest UB is: 19Therefore A3 (LB:20) and below are not necessary

Page 30: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

30

Presentation Outline• Introduction to Top-K Query Processing• Centralized techniques

– Fagin’s AlgorithmOptimal Algorithms: TA (Threshold Algorithm)– Optimal Algorithms: TA (Threshold Algorithm)

– Restricted Access Models: TA-Sorted– Probabilistic TA-Sorted– Using previous query instantiations: LPTA

• Distributed techniques– Exact algorithms– Exact algorithms with fixed rounds of communication: TPUT, TJA,

TPAT

59

– Approximate Algorithms Using data distribution information: KLEE

– Exact algorithms using upper/lower bounds: LBK• Online Algorithms for Monitoring Top-K results: BABOLS, TMA• Future Work

Online algorithms

• Top-K monitoring for tid11

Q1X1RT

(1,1)(0,1)

[Mouratidis, Bakiras, Papadias, SIGMOD’06]

Top K monitoring for stream data:The TMA/SMA algorithms

• Monitor multiple Top-K queries simultaneously– Efficiently identify the effect

1

tid12

(10)

( , )( , )

60

Efficiently identify the effect of changes:Only some Top-K results change

Q2O P(0,0)

(1,0)

Page 31: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

31

Monitoring Top-K Results in a Distributed Setting

• The setting: changes come over time • Use any efficient algorithm for finding top-K

[Babcock, Olston, SIGMOD’03]

Use any efficient algorithm for finding top K• Monitor changes: only if large changes happen

you have to recompute– Approximate algorithm: top-K results are correct within ε– Need an algorithm that decides how big a change can

we tolerate per node:

61

Query: 2 X1 + X2

Tuples: t1 = (3,9), t2 = (5,1)Slack per attribute is:

((Score(t1) – Score(t2)) + ε )/2 = 2+ε/2

Conclusions• Top-K Query Processing is an area with

– Many applications in practical problemsMany challenges and opportunities!– Many challenges and opportunities!

• Privacy issues• Approximate algorithms• Online algorithms

62

• Modeling and exploiting correlations

Page 32: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

32

References

– Amato G.,Rabitti F.,Savino P. and Zezula P., “Region proximity in metric spaces and its use for approximate similarity search”, In TOIS 20032003.

– Babcock B. and Olston C., “Distributed Top-K Monitoring”, In Proceedings of the ACM SIGMOD international conference on Management of data, San Diego, CA, USA, Pages 28-39, 2003.

– Balke W.-T., Nejdl W., Siberski W., Thaden U., “Progressive Distributed Top-K Retrieval in Peer-to-Peer Networks”, In Proceedings of the 21st International Conference on Data Engineering, April 5-8, Tokyo, Japan, 2005. Banerjee A Mitra A Najjar W Zeinalipour Yazti D Kalogeraki V and

63

– Banerjee A.,Mitra A.,Najjar W.,Zeinalipour-Yazti D., Kalogeraki V. and Gunopulos D., ”RISE Co-S : High Performance Sensor Storage and Co-Processing Architecture”, Second Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks,(SECON’2005), Santa Clara, California, USA, 2005.

References

– Bloom B. H., “Space/Time Trade-Offs in Hash Coding with Allowable Errors”, Communication of the ACM, 13(7):422-426, 1970.

– Bruno N.,Gravano L. and Marian A., “EvaluatingTop-K Queries Over Web Accessible Databases”, In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, Page 369, 2002.

– Cao P. and Wang Z., “Efficient Top-K Query Calculation in Distributed Networks”, In Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing,St.John’s,Newfoundland, Canada, Pages 206-215, 2004.

– Chun B.N., Culler D.E, Roscoe T., Bavier A.C, Peterson L.L, Wawrzoniak M., Bowman M., “PlanetLab: an overlay testbed for broad-

64

coverage services”, Computer Communication Review Volume33, Issue 3, Pages 3-12, 2003.

– Claffy K., Tracie E., McRobb D. ”Internet tomography”, 1999.

Page 33: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

33

References– Considine J., Li F., Kollios G., and Byers J., “Approximate Aggregation

Techniques for Sensor Databases”, In Proceedings of the 20th International Conference on Data Engineering, Boston, MA, USA, Page 449 2004449, 2004.

– Deligiannakis A., Kotidis Y. Roussopoulos N., “Hierarchical in-Network Data Aggregation with Quality Guarantees”, In 9th International Conference on Extending Database Technology, Heraklion, Greece, March 14-18, Pages 658-675, 2004.

– Donjerkovic D. and Ramakrishnan R., “Probabilistic Optimization of Top-N Queries”, In Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Pages 411-422, 1999

65

1999. – Fagin R., “Combining Fuzzy Information from Multiple Systems”, In

Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Montreal, Quebec, Canada, Pages 216-226, 1996.

References– Fagin R., “Fuzzy Queries In Multimedia Database Systems”, In

Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, WA, USA, pp. 1-10 19981 10, 1998.

– Fagin R., Lotem A. and Naor M., “Optimal Aggregation Algorithms For Middleware”, In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, CA, USA, Pages 102-113, 2001.

– Gravano L. and Chaudhuri S., “Evaluating Top-K Selection Queries”, In Proceedings of the 25th International Conference on Very Large DataBases, Edinburgh, Scotland, UK, Pages 397-410, 1999. Guntzer U Balke W Kiebling W “Optimizing Multi Feature Queries for

66

– Guntzer U.,Balke W.,Kiebling W., “Optimizing Multi Feature Queries for Image Databases”, In VLDB 2000.

– Hansen, T.,Otero, J.,McGregor, A.,Braun, H-W., “Active measurement data analysis techniques”, In Proceedings of the International Conference on Communications in Computing, Las Vegas, Nevada, pp.105,2000.

Page 34: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

34

References– Ilyas I.F., Aref W.G. and Elmagarmid A.K., “Supporting Top-k Join

Queries in Relational Databases”, In The VLDBJournal –The International Journal on Very Large Data Bases, Vol. 13 , Iss. 3, pp. 207-221 2003207 221, 2003.

– Kalnis P., Ng W-S., Ooi B-C., Tan K-L., “Answering similarity queries in peer-to-peer networks”, In Proceedings of the 14th International World Wide Web Conference, Pages 482-483, New York City, NY, USA, 2004.

– Kiessling W., “Foundations of Preferences in Database Systems”, InProceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, Pages 311-322, 2002.

– Lv Q., Cao P., Cohen E., Lai K., Shenker S., “Search and Replication in Unstructured Peer to Peer Networks” In Proceedings of the 16th

67

Unstructured Peer-to-Peer Networks”, In Proceedings of the 16th international conference on Supercomputing, New York,NY,USA,Pages 84-95, 2002.

– Nepal S., Ramakrishna M. V., “Query Processing Issues in Image(Multimedia) Databases”, In ICDE 1999.

References– Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., “TAG: a Tiny

AGgregation Service for Ad-Hoc Sensor Networks”, In Proceedings of the 5th symposium on Operating systems design and implementation, Boston MA pp 131-146 2002Boston, MA, pp. 131 146, 2002.

– Madden S.R., Franklin M.J., Hellerstein J.M., Hong W., ”The Design of an Acquisitional Query Processor for Sensor Networks”, In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, CA, USA, Pages 491-502, 2003.

– Marian A., Gravano L., Bruno N., “ Evaluating Top-k Queries over Web-Accessible Databases”, In TODS 2004.

– Michel S., Triantafillou P., Weikum G., “KLEE: A Framework for Distributed Top K Query Algorithms” In 31st conference in the series of

68

Distributed Top-K Query Algorithms”, In 31st conference in the series of the Very Large Data Bases, Trondheim, Norway, 2005.

– Nejdl W., Siberski W., Thaden U. and Balke W., “Top-k Query Evaluation for Schema-Based Peer-to-Peer Networks”, In ISWC 2004.

Page 35: Top-K Query Processing · 2016. 10. 23. · 4 Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 webpages. •

35

References– Szewczyk R., Osterweil E., Polastre J., Hamilton M., Mainwaring A.M.,

Estrin D., “Habitat monitoring with sensor networks”, Commun.ACM47(6):34-40(2004). Theobald M Schenkel R Weikum G “Top k Query Evaluation with– Theobald M., Schenkel R., Weikum G., Top-k Query Evaluation with Probabilistic Guarantees”, In VLDB 2004.

– Tsoumakos D. and Roussopoulos N., “Adaptive Probabilistic Search for Peer-to-Peer Networks”, In Proceedings of the Third International Conference on Peer-to-Peer Computing, Linkoping, Sweden, Pages 102-110, 2003.

– Xiong L., Chitti S., Liu L., “Top-k Queries across Multiple Private Databases”, In ICDCS 2005.

69

– Yang B. and Garcia-Molina H., “Efficient Search in Peer-to-Peer Networks”. In Proceedings of the 22nd International Conference on Distributed Computing Systems Vienna, Austria, Pages 5-14, 2002.

References– Zeinalipour-Yazti D., Vagena Z., Gunopulos D., Kalogeraki V., Tsotras

V., Vlachos M., Koudas N., Srivastava D., “The Threshold Join Algorithm for Top-K Queries in Distributed Sensor Networks”, In Proceedings of the 2nd International Workshop on Data ManagementProceedings of the 2nd International Workshop on Data Management for Sensor Networks, collocated with VLDB 2005, Trondheim, Norway, 2005.

– Zeinalipour-Yazti D., Lin S., Kalogeraki V., Gunopulos D., Najjar W., ”MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices”, In Proceedings of the 4th USENIX Conference on File and Storage Technologies(FAST’2005) SanFrancisco,CA,December14-16, pp. 31-44, 2005.

70