a unified approach for computing top-k pairs in multidimensional space

22
A Unified Approach for Computing Top- k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1 , Haixun Wang 2 , Jianmin Wang 3 , Wenjie Zhang 1 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China

Upload: cassie

Post on 09-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

A Unified Approach for Computing Top-k Pairs in Multidimensional Space. Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1 , Haixun Wang 2 , Jianmin Wang 3 , Wenjie Zhang 1. 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Presented By: Muhammad Aamir Cheema1

Joint work withXuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie Zhang1

1 University of New South Wales, Australia2 Microsoft Research Asia3 Tsinghua University, China

Page 2: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Introduction

Top-k Pairs Query: Given a scoring function f() that computes the score of

a pair of objects, return k pairs of objects with smallest scores.

o2

o1

o3

o4

o5

x-axis

y-ax

is

Examples:

k-closest pairs

f(ou,ov) = dist(ou,ov)

Answer (k=1) = (o1,o2)

k-furthest pairs

f(ou,ov) = - dist(ou,ov)

Answer (k=1) = (o2,o4)

f(ou,ov) = (ou.x +ov.x) + (ou.y +ov.y)

Answer (k=1) = (o4,o5)

Page 3: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Related Work

• Computational geometry [M Smid, Handbook on Comp. Geometry]• Database community

[Hjaltason et. al, SIGMOD 1998][Corral et. al, SIGMOD 2000][Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003]

• Computational geometry [M Smid, Handbook on Comp. Geometry]• Database community

[Hjaltason et. al, SIGMOD 1998][Corral et. al, SIGMOD 2000][Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003]

K-Closest Pairs Queries

[Supowit , SODA 1990][Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004]

[Supowit , SODA 1990][Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004]

K-Furthest Pairs Queries

Top-k Queries• Fagin’s Algorithm [Fagin, PODS 1996]•Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]

• Fagin’s Algorithm [Fagin, PODS 1996]•Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] , [Gȕntzer et. al, VLDB 2000] • No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]

Page 4: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Motivation

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

• Other Lp distances (e.g., Manhattan distance) ?• More general scoring functions• Chromatic queries

• Other Lp distances (e.g., Manhattan distance) ?• More general scoring functions• Chromatic queries

• No existing work for more general queries

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager <> b.managerAND a.manager <> b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager <> b.managerAND a.manager <> b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

• No existing unified algorithm • One framework that answers a broad class of top-k pairs

queries• One framework that answers a broad class of top-k pairs queries

Page 5: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Problem Definition (Preliminaries)

• Monotonic function

f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N

Examples:f(x1,…,xN) = x1 + x2 + … + xN (summation) f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)

f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N

Examples:f(x1,…,xN) = x1 + x2 + … + xN (summation) f(x1,…,xN) = (x1 + x2 + … + xN) / N (average)

Page 6: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Problem Definition (Preliminaries)

• Loose monotonic function

0 ∞-∞

s() takes two parameters and is loose monotonic if both of following hold for every fixed value x1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases

s() takes two parameters and is loose monotonic if both of following hold for every fixed value x1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases

x

1

y

2 5

s1(x,y) = |x – y| 1 4 = s2(x,y) = (x + y) 3 6 =

y

-3

1 -2

Loose monotonic functions are more general than the monotonic functions

Loose monotonic functions are more general than the monotonic functions

Page 7: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Problem Definition

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;

• Return k pairs of objects with smallest scores.

SCORE (a,b) = f ( s1(a,b),…,sd(a,b) )si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes.

SCORE (a,b) = f ( s1(a,b),…,sd(a,b) )si( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes.

s1(a,b) = | a.sold – b.sold | s2(a,b) = -| a.salary – b.salary |f( ) = s1(a,b) + s2(a,b)

Page 8: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Problem Definition

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – ORDER BY |a.sold – b.sold| - |a.salary – b.salary|b.salary|LIMIT k;LIMIT k;

• Return k pairs of objects with smallest scores among the valid pairs.

Let each object be assigned a color.Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors

Let each object be assigned a color.Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager AND a.manager ≠≠ b.manager b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager AND a.manager ≠≠ b.manager b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager = b.managerAND a.manager = b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.manager = b.managerAND a.manager = b.managerORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

Page 9: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Contributions

• k-closest pairs, k-furthest pairs and variants (any Lp distance)• queries involving any arbitrary subset of attributes• chromatic and non-chromatic queries• skyline pairs queries and rank based top-k pairs queries

• k-closest pairs, k-furthest pairs and variants (any Lp distance)• queries involving any arbitrary subset of attributes• chromatic and non-chromatic queries• skyline pairs queries and rank based top-k pairs queries

Unified algorithm (internal and external)

• efficiently builds a simple data structure on-the-fly• can answer queries involving filtering conditions on objects• efficiently builds a simple data structure on-the-fly• can answer queries involving filtering conditions on objects

No pre-built indexes required

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.age > 40 AND b.age > AND a.age > 40 AND b.age > 4040ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

SELECT a.id , b.id FROM AGENT a, AGENT bSELECT a.id , b.id FROM AGENT a, AGENT bWHERE a.id < b.id WHERE a.id < b.id AND a.age > 40 AND b.age > AND a.age > 40 AND b.age > 4040ORDER BY |a.sold – b.sold| - |a.salary – b.salary|ORDER BY |a.sold – b.sold| - |a.salary – b.salary|LIMIT k;LIMIT k;

• existing R-tree based approaches may require arbitrarily large heaps• our algorithm requires O(k) space + 2d buffer pages

• existing R-tree based approaches may require arbitrarily large heaps• our algorithm requires O(k) space + 2d buffer pages

Known memory requirement

• Theoretically Optimal for d ≤ 2 • Experimentally• Theoretically Optimal for d ≤ 2 • Experimentally

Efficient

Page 10: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Framework

s1(a,b) s2(a,b) sd(a,b)

Top-K algorithms (e.g., FA, TA, NRA etc.)

How to efficiently create and maintain these sources???

f ( s1(a,b), s2(a,b), …,sd(a,b) )

Page 11: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Creating/maintaining sources

Naïve approach

• Create all possible pairs O(N2)• Sort them according to their local scores O(N2 log N)

space requirement: O(N2)

• Create all possible pairs O(N2)• Sort them according to their local scores O(N2 log N)

space requirement: O(N2)

Features of our approach

• Optimal internal memory algorithm• requires O(N) space• returns first pair in O(N log N)• each next best pair is returned in O( log N)

• Optimal external memory algorithm• B = number of elements that can be stored in one disk page• M = used internal memory minimum M = 2B• returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)

• Optimal internal memory algorithm• requires O(N) space• returns first pair in O(N log N)• each next best pair is returned in O( log N)

• Optimal external memory algorithm• B = number of elements that can be stored in one disk page• M = used internal memory minimum M = 2B• returns first pair in O(N/B logM/B N/B) • each next best pair is returned in O(logM/B N/B)

Page 12: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Creating/maintaining sources

6 12 14 15 20 302 1 5 106

o1 o2 o3 o4 o5 o6

63

Initialize• sort the objects• for each object ou

• create its best pair (ou,ov)• insert (ou,ov) in heap

getNextPair()• report the top pair (ou,ov) of heap• create next best pair of ou

• enheap the new pair and delete (ou,ov)s(x,y) = |x – y|

Page 13: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Homochromatic Queries

6 12 14 15 20 30

o1o2 o3 o4

o6o5

Page 14: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Heterochromatic Queries

6 12 14 15 20 30

o1o2 o3 o4

o6o5

• Let (ou,ov) be the pair• ox = the object next to ov

• If ou and ox have different color•(ou,ox) is the next best pair

• else•oy = the adjacent object of ox • (ou,oy) is the next best pair

Page 15: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Experiments

K-closest pairs queries [Corral et. al, SIGMOD 2000]• Data size: two dataset each containing 100K objects• k: 10

Page 16: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Experiments

• Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm)

• Scoring function:• Local scoring function is either sum or absolute difference (chosen randomly)• Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)

Page 17: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Number of Objects

Page 18: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Number of attributes (d)

Page 19: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Value of k

Page 20: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Number of colors

Page 21: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Thanks

Page 22: A Unified Approach for Computing Top-k Pairs in Multidimensional Space

Complexity

Internal memory algorithm =

External memory algorithm =

d = number of local scoring functions involvedN = total number of objectsV = total number of valid pairs (N2 at most)M = internal memory used by the algorithmB = the number of entries one disk page can store