faster upper bounding of intersection sizes2500 5000 7500 10000 12500 15000 lm bs dk1h dk2h bf scf...
Post on 10-Nov-2020
1 Views
Preview:
TRANSCRIPT
© 2013 IBM
FASTER UPPER BOUNDING OF
INTERSECTION SIZES
IBM Research Tokyo - Japan
Daisuke Takuma
Hiroki Yanagisawa
Outline
• Motivation of upper bounding
• Contributions
– Data structure
• Proposal
• Evaluation (introduce related work here)
– Application
• Evaluation
• Conclusion
Set intersection
2
7
5
10
3
8
1
9
18
6
11
Intersection: A∩B = {2, 5, 7}
Intersection size: |A∩B| = 3
AB
Applications:
•Multi-word search
•JOIN operation
:
Computation of the common elements in given sets:
Recently the intersection
operation is used in a
different way.
Some apps compute intersections only for their sizes.
Top-k term query
Input: the hit docs for a search query
Output: the k most frequent terms in the hit docs
MILES
DRIVING
HIGHWAY
Document sets Top-k terms (k=3)
MILES
GEAR
Hit docs
for a search query
Ex. “ENGINE”HIGHWAY
DRIVINGDOOR
Determined by the
intersection sizes
Intersection sizes are evaluated in inequalities:
|A∩B| > threshold, but mostly |A∩B| ≪ threshold.
In 80% intersections in top-k term queries:
|A∩B| < 0.05 × threshold
|A∩B| ≦ threshold
Upper bound of |A∩B|
≦
≦ � If this holds, we can
skip the computation of
the exact intersection
Our contributions:
• We proposed data structures Cardinality Filters (CF) to
quickly compute upper bounds of intersection sizes.
• We accelerated the top-k term query using CFs
approximately by factor of 2.
•� CFs are applicable to all the algorithm evaluating
|A∩B| ≧ threshold
by adding the condition for the upper bound.
upperbound(|A∩B|) ≧ threshold AND
|A∩B| ≧ thresholdIt does not change the logic.
Cardinality Filters
CF size function: |・|
|Φ(A) ∩ Φ(B)||A∩B|
X
Sets A, B ∈ 2X
A B A∩B (set intersection)
Set size function: |・|
Requirements for CFs
≦
Cardinality filters (CFs) Φ(A), Φ(B) ∈ Φ(2X)
Φ(A)
Φ(B)
CF conversion Φ
Φ(A)∩ Φ(B) (CF intersection)
SCF: Single Cardinality Filter
c(A): 10 14
c(B): 7 10 11 14
A
h(A)
h(B)
B
7 8 10 12 14
0 2 3 5 7 10 11 14
Φ(A)
Φ(B)
0 2 4
0 1 2 4
Φ(A) = (h(A), c(A)) where
h(A) hash values, c(A): non-smallest values in collisions
|h(A)∩h(B)|=3 + |c(A)∩c(B)|=2|Φ(A)∩Φ(B)| =
Bit-wise AND Small
Formal definition of SCF
Let X be a universe set (A, B ⊂ X),
and H be a hash value space H={ 0, 1, N, [|X|/N] - 1},
we define SCF Φ: 2X � (2H, 2X) by
Φ(A) = (h(A), c(A)) (A ⊂ X)
where
h(A) = { h(x) | x ∈ A }
c(A) = { y∈A | (∃x∈A) (h(x) = h(y), x < y) }
Thm:
|A∩B| ≦ |Φ(A)∩Φ(B)| := |h(A)∩h(B)| + |c(A)∩c(B)|
•Directly used in the proof.
•The proof has no conditional branches.
RCF: Recursive Cardinality Filter
The RCF recursively applies the SCF to c(A).
h1(A) c1(A)
h2(c1(A)) c2(c1(A))
A
h3(c2(c1(A))) c3(c2(c1(A)))
H1(A), C1(A)
H2(A), C2(A)
H3(A), C3(A)
|Φ(A)∩Φ(B)|=Σ1~L |Hi(A)∩Hi(B)| + |CL(A)∩CL(B)|
L = 3
We compared CFs with the following algorithms:
• Exact intersections
– Linear merge: comparison between sorted lists.
– Binary search: binary search x∈A in B (|A| < |B|).
– DK algorithm (VLDB 2011):
• Intersect h(A) and h(B) by bit-wise AND (h:hash)
• For all x ∈ h (A)∩h(B),
intersect h-1(x)∩A and h-1(x)∩B.
• Naïve upper bounding
– Bloom filter: Test x∈A in the Bloom filter of B: h(B).
Candidates
(A) Large (1M x 1M) - no correlation
0
2500
5000
7500
10000
12500
15000
LM BS
DK
1H
DK
2H
BF
SC
F
RC
F
Time[microsec]
0
0.5
1
1.5
2
2.5
3
Approx. ratio
TimeApproximation ratio
(B) Middle (100K x 100K) - no correation
0
250
500
750
1000
1250
1500
LM BS
DK1H
DK2H
BF
SC
F
RC
F
0
2
4
6
8
10
12(C) Small (10K x 10K) - no correlation
0
25
50
75
100
125
150
LM BS
DK1H
DK2H BF
SC
F
RC
F
0
5
10
15
20
25
30
(D) Asymmetric (1M x 10K) - no correlation
0
250
500
750
1000
1250
1500
LM BS
DK
1H
DK
2H
BF
SC
F
RC
F
0
1
2
3
4
5
6(E) Middle (100K x 100K) - positive correlation
0500
10001500200025003000
LM BS
DK
1H
DK
2H
BF
SC
F
RC
F00.511.522.53
(F) Middle (100K x 100K) - negative correlation
0
500
1000
1500
2000
2500
3000
LM
BS
DK1H
DK2H
BF
SC
F
RC
F
0
20
40
60
80
100
120
Performance studies:
Size
Asymmetry Correlation
traditional DK naïve ours
Time complexity and performance break down
0 500 1000
RCF
SCF
DK2H
DK1H
Alg
orithm
s
Time [microsec]
Bit-wise AND
Other computations
DK: O( (|A|+|B|)/√w + |A∩B| )
RCF: O( (|A|+|B|)/w )
(With a practical setting of h)
|A∩B| is the barrier for all
the exact algorithms.
CFs maximize the utilization of bit-wise AND.
The benefit of considering
the upper bounds.
Improvement of the top-k term query
S: the doc IDs for
the search query
P[i] The doc IDs for term i (i=0, 1, N)Term CFs
h1
h1
h2
h2
h2
h5
h5
h1
h2
h5
Query CF
(Candidate CFs)
Index
The intersection sizes of CFs are evaluated before
evaluation of the exact intersection sizes.
Intersect the CFs using the same hash functions.
The compression ratio
for the n-th term:
Safe factor ×|P[k-1]| / |P[n]|
Improvement of top-k term queries
0
200
400
600
800
100 1000 10000 100000
The number of hit documents
Com
puta
tion t
ime
[mill
isec]
LM LM+BS DK1H SCF RCF
Ours
More than 80% of the exact intersections were skipped.
0.5
0.6
0.7
0.8
0.9
1
100
200
500
1000
2000
5000
10000
20000
50000
100000
The number of hit documents
Ski
p ra
tio
SCF RCF
Summary
• We proposed Cardinality Filters (SCF, RCF) to quickly
compute upper bounds of intersection sizes.
– Property: |A∩B| ≦ |Φ(A)∩Φ(B)|
– Performance improvement from the exact algorithms.
DK: O( (|A|+|B|)/√w + |A∩B| )
RCF: O( (|A|+|B|)/w )
• We accelerated the top-k term query using the CFs by
factor of 2.
Future work
• Using other algebraic properties of SCF and RCF:
– Monotonicity: A⊂B ⇒ |Φ(A)| ≦ |Φ(B)|
– Commutative and associative laws of Φ(A)∩Φ(B)
• Implementation of CFs having Union-like operation
• Non-parametric CFs (compression ratio of h)
Faster Upper Bounding of Intersection Sizes
Thank you !!
top related