faster upper bounding of intersection sizes2500 5000 7500 10000 12500 15000 lm bs dk1h dk2h bf scf...

21
© 2013 IBM FASTER UPPER BOUNDING OF INTERSECTION SIZES IBM Research Tokyo - Japan Daisuke Takuma Hiroki Yanagisawa

Upload: others

Post on 10-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

© 2013 IBM

FASTER UPPER BOUNDING OF

INTERSECTION SIZES

IBM Research Tokyo - Japan

Daisuke Takuma

Hiroki Yanagisawa

Page 2: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Outline

• Motivation of upper bounding

• Contributions

– Data structure

• Proposal

• Evaluation (introduce related work here)

– Application

• Evaluation

• Conclusion

Page 3: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Set intersection

2

7

5

10

3

8

1

9

18

6

11

Intersection: A∩B = {2, 5, 7}

Intersection size: |A∩B| = 3

AB

Applications:

•Multi-word search

•JOIN operation

:

Computation of the common elements in given sets:

Recently the intersection

operation is used in a

different way.

Page 4: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Some apps compute intersections only for their sizes.

Top-k term query

Input: the hit docs for a search query

Output: the k most frequent terms in the hit docs

MILES

DRIVING

HIGHWAY

Document sets Top-k terms (k=3)

MILES

GEAR

Hit docs

for a search query

Ex. “ENGINE”HIGHWAY

DRIVINGDOOR

Determined by the

intersection sizes

Page 5: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Intersection sizes are evaluated in inequalities:

|A∩B| > threshold, but mostly |A∩B| ≪ threshold.

In 80% intersections in top-k term queries:

|A∩B| < 0.05 × threshold

|A∩B| ≦ threshold

Upper bound of |A∩B|

≦ � If this holds, we can

skip the computation of

the exact intersection

Page 6: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Our contributions:

• We proposed data structures Cardinality Filters (CF) to

quickly compute upper bounds of intersection sizes.

• We accelerated the top-k term query using CFs

approximately by factor of 2.

•� CFs are applicable to all the algorithm evaluating

|A∩B| ≧ threshold

by adding the condition for the upper bound.

upperbound(|A∩B|) ≧ threshold AND

|A∩B| ≧ thresholdIt does not change the logic.

Page 7: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Cardinality Filters

Page 8: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

CF size function: |・|

|Φ(A) ∩ Φ(B)||A∩B|

X

Sets A, B ∈ 2X

A B A∩B (set intersection)

Set size function: |・|

Requirements for CFs

Cardinality filters (CFs) Φ(A), Φ(B) ∈ Φ(2X)

Φ(A)

Φ(B)

CF conversion Φ

Φ(A)∩ Φ(B) (CF intersection)

Page 9: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

SCF: Single Cardinality Filter

c(A): 10 14

c(B): 7 10 11 14

A

h(A)

h(B)

B

7 8 10 12 14

0 2 3 5 7 10 11 14

Φ(A)

Φ(B)

0 2 4

0 1 2 4

Φ(A) = (h(A), c(A)) where

h(A) hash values, c(A): non-smallest values in collisions

|h(A)∩h(B)|=3 + |c(A)∩c(B)|=2|Φ(A)∩Φ(B)| =

Bit-wise AND Small

Page 10: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Formal definition of SCF

Let X be a universe set (A, B ⊂ X),

and H be a hash value space H={ 0, 1, N, [|X|/N] - 1},

we define SCF Φ: 2X � (2H, 2X) by

Φ(A) = (h(A), c(A)) (A ⊂ X)

where

h(A) = { h(x) | x ∈ A }

c(A) = { y∈A | (∃x∈A) (h(x) = h(y), x < y) }

Thm:

|A∩B| ≦ |Φ(A)∩Φ(B)| := |h(A)∩h(B)| + |c(A)∩c(B)|

•Directly used in the proof.

•The proof has no conditional branches.

Page 11: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

RCF: Recursive Cardinality Filter

The RCF recursively applies the SCF to c(A).

h1(A) c1(A)

h2(c1(A)) c2(c1(A))

A

h3(c2(c1(A))) c3(c2(c1(A)))

H1(A), C1(A)

H2(A), C2(A)

H3(A), C3(A)

|Φ(A)∩Φ(B)|=Σ1~L |Hi(A)∩Hi(B)| + |CL(A)∩CL(B)|

L = 3

Page 12: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

We compared CFs with the following algorithms:

• Exact intersections

– Linear merge: comparison between sorted lists.

– Binary search: binary search x∈A in B (|A| < |B|).

– DK algorithm (VLDB 2011):

• Intersect h(A) and h(B) by bit-wise AND (h:hash)

• For all x ∈ h (A)∩h(B),

intersect h-1(x)∩A and h-1(x)∩B.

• Naïve upper bounding

– Bloom filter: Test x∈A in the Bloom filter of B: h(B).

Candidates

Page 13: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

(A) Large (1M x 1M) - no correlation

0

2500

5000

7500

10000

12500

15000

LM BS

DK

1H

DK

2H

BF

SC

F

RC

F

Time[microsec]

0

0.5

1

1.5

2

2.5

3

Approx. ratio

TimeApproximation ratio

(B) Middle (100K x 100K) - no correation

0

250

500

750

1000

1250

1500

LM BS

DK1H

DK2H

BF

SC

F

RC

F

0

2

4

6

8

10

12(C) Small (10K x 10K) - no correlation

0

25

50

75

100

125

150

LM BS

DK1H

DK2H BF

SC

F

RC

F

0

5

10

15

20

25

30

(D) Asymmetric (1M x 10K) - no correlation

0

250

500

750

1000

1250

1500

LM BS

DK

1H

DK

2H

BF

SC

F

RC

F

0

1

2

3

4

5

6(E) Middle (100K x 100K) - positive correlation

0500

10001500200025003000

LM BS

DK

1H

DK

2H

BF

SC

F

RC

F00.511.522.53

(F) Middle (100K x 100K) - negative correlation

0

500

1000

1500

2000

2500

3000

LM

BS

DK1H

DK2H

BF

SC

F

RC

F

0

20

40

60

80

100

120

Performance studies:

Size

Asymmetry Correlation

traditional DK naïve ours

Page 14: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Time complexity and performance break down

0 500 1000

RCF

SCF

DK2H

DK1H

Alg

orithm

s

Time [microsec]

Bit-wise AND

Other computations

DK: O( (|A|+|B|)/√w + |A∩B| )

RCF: O( (|A|+|B|)/w )

(With a practical setting of h)

|A∩B| is the barrier for all

the exact algorithms.

CFs maximize the utilization of bit-wise AND.

The benefit of considering

the upper bounds.

Page 15: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Improvement of the top-k term query

Page 16: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

S: the doc IDs for

the search query

P[i] The doc IDs for term i (i=0, 1, N)Term CFs

h1

h1

h2

h2

h2

h5

h5

h1

h2

h5

Query CF

(Candidate CFs)

Index

The intersection sizes of CFs are evaluated before

evaluation of the exact intersection sizes.

Intersect the CFs using the same hash functions.

The compression ratio

for the n-th term:

Safe factor ×|P[k-1]| / |P[n]|

Page 17: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Improvement of top-k term queries

0

200

400

600

800

100 1000 10000 100000

The number of hit documents

Com

puta

tion t

ime

[mill

isec]

LM LM+BS DK1H SCF RCF

Ours

Page 18: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

More than 80% of the exact intersections were skipped.

0.5

0.6

0.7

0.8

0.9

1

100

200

500

1000

2000

5000

10000

20000

50000

100000

The number of hit documents

Ski

p ra

tio

SCF RCF

Page 19: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Summary

• We proposed Cardinality Filters (SCF, RCF) to quickly

compute upper bounds of intersection sizes.

– Property: |A∩B| ≦ |Φ(A)∩Φ(B)|

– Performance improvement from the exact algorithms.

DK: O( (|A|+|B|)/√w + |A∩B| )

RCF: O( (|A|+|B|)/w )

• We accelerated the top-k term query using the CFs by

factor of 2.

Page 20: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Future work

• Using other algebraic properties of SCF and RCF:

– Monotonicity: A⊂B ⇒ |Φ(A)| ≦ |Φ(B)|

– Commutative and associative laws of Φ(A)∩Φ(B)

• Implementation of CFs having Union-like operation

• Non-parametric CFs (compression ratio of h)

Page 21: FASTER UPPER BOUNDING OF INTERSECTION SIZES2500 5000 7500 10000 12500 15000 LM BS DK1H DK2H BF SCF RCF Time [microsec] 0 0.5 1 1.5 2 2.5 3 Approx. ratio Time Approximation ratio (B)

Faster Upper Bounding of Intersection Sizes

Thank you !!