foto afrati — national technical university of athens anish das sarma — clearlist inc
DESCRIPTION
Anchor Points Algorithms for Hamming and Edit Distance. Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc . Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/1.jpg)
1
Foto Afrati — National Technical University of Athens
Anish Das Sarma — ClearList Inc.Anand Rajaraman — Cambrian Ventures
Pokey Rule — Stanford UniversitySemih Salihoglu — Stanford University
Jeff Ullman — Stanford University
Anchor Points Algorithms for Hamming and Edit Distance
![Page 2: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/2.jpg)
Fuzzy Joins
2
Input: set of records ROutput: <reci, recj> pairs s.t. dist(reci, recj) ≤ d
rec1
rec2
…recm
Input Output<rec1, rec5><rec7, rec9>
…<rec3, reck>
Example Applications: entity resolution, clustering, collaborative filtering
![Page 3: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/3.jpg)
Two Specific Distance Measures
3
1. Hamming Distance Input: bit strings R of length n
2. Edit Distance Input: strings R of length n over alphabet A
0000000001
…10011
<00000, 00001>
…<10011, 10010>
abcd
eabc…
dddd
<abcd, eabc>
…<dddd, dadd>
![Page 4: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/4.jpg)
Fuzzy Joins In One-Round MapReduce
4
rec1
rec2
rec3
…
recm-1
recm
Map
values
rec1, rec5, rec7
rec2, rec7, recm
…
rec2, recm
Reduce
key
reducer1
reducer2
…
reducerp
Per-Reducer-Memory-Cost
Communication Cost
![Page 5: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/5.jpg)
5
communication
|R|=2n
2 |R|=2n
Grouping
(naïve)
per-reducer memory
22n
2n-d+1
Ball-Hashing
O(nd/2)
Splitting
Communication Cost vs Per-reducer Memory
Anchor Points
![Page 6: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/6.jpg)
Outline
6
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Covering
Codes
3. Explicit Construction of Edit Distance Covering Codes
![Page 7: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/7.jpg)
Outline
7
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Codes
3. Explicit Construction of Edit Distance Codes
![Page 8: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/8.jpg)
Covering Code
8
Given set of strings R of length n, and radius k Definition: <n, k> covering code C
for each s∈R, there is a c∈C, s.t dist(c, s) ≤ k
kn length of stringsd distance of pairsk radius of code
![Page 9: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/9.jpg)
Example Covering Code
9
01111 … 11101 11110
00111 … 10011 … 11100
00011 00101 … 10001 11000
00001 … 01000 10000
Example: Hamming Distance, n=5, k = 2
… …
……
…
… …
…n length of stringsd distance of pairsk radius of code
11111
00000
R
![Page 10: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/10.jpg)
10
00000010000101101100
…1111011111
Map Reduce
Let C be an <n, k> covering code => (e.g. n=5, k=2)One reducer for each code wordMap s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)
Anchor Points Algorithm (1)
r00000
r11111
![Page 11: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/11.jpg)
11
Anchor Points Algorithm (2)
≤d/2
c
v≤d
≤k
u
w≤d/2
≤k + d/2≤k + d/2
Triangle Inequality
n length of stringsd distance of pairsk radius of code
![Page 12: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/12.jpg)
12
Cost of Anchor Points Algorithm
B(n, r): size of the ball of radius rPer-reducer memory: B(n, k + d/2)Communication: |C|B(n, k + d/2)
Reducer for code word c
c
k + d/2s4
s7 s6
s9
s17
s11
s5
s1
n length of stringsd distance of pairsk radius of code
![Page 13: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/13.jpg)
13
communication
|R|=2n
2 |R|=2n
Groupin
g (naïve)
per-reducer memory
22n
2n-d+1
Ball-Hashing
O(nd/2)
Splitting
Anchor Pointsk=0
k=1
k=2
k=n
n length of stringsd distance of pairsk radius of code
Communication Cost vs Per-reducer Memory
![Page 14: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/14.jpg)
Outline
14
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Codes
3. Explicit Construction of Edit Distance Codes
![Page 15: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/15.jpg)
Some Known Hamming Distance Codes
15
k n |C|0 any 2n
n any 11 n=2r-1 2n/n+1
Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k)
Hamming Codes
n length of stringsd distance of pairsk radius of code
For any k: existence of n2n/B(n, k) => not Perfect Problem: no explicit construction
![Page 16: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/16.jpg)
16
Cross Product Method (Explicit HD <n, k> Codes)Start with <n/t, k/t> code DLet C = D x D x … x D (t times)Claim: C is a <n, k> covering codeProof:
s = s1 s2 s3 … st
c = d1 d2 d3 … dt
≤k/t ≤k/t ≤k/t ≤k/tdist(s, c) ≤ k
n length of stringsd distance of pairsk radius of code
![Page 17: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/17.jpg)
Example of Cross Product Methodn = 10, k = 4, t=2 => use a <5, 2>
code D D = {00000, 11111}
17
00000--11111
11111--11111
11111--00000
00000--00000
1100011100≤2+2
=4
1110000001
≤2+1=3
11000--11100
11100--00001
n length of stringsd distance of pairsk radius of code
![Page 18: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/18.jpg)
Size of Cross Product Codes: Dk
Assume D is perfect (e.g., Hamming code)
18
Perfect <n, k> code:
For large n, small t => same asymptotic size
Example: n, k=2, t=2
vs
![Page 19: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/19.jpg)
Outline
19
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Covering
Codes
3. Explicit Construction of Edit Distance Covering Codes
![Page 20: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/20.jpg)
Edit Distance Fuzzy Joins
20
abcd
eabc
cadb…
dadd
dddd
<abcd, eabc>
…<dddd, dadd>
Input Output
strings of length n over alphabet A (i.e.,|A|n strings)
Covering codes algorithm works in the same way: If C is a <n, k> edit distance code Send s to all code words at distance k+d/2
![Page 21: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/21.jpg)
Differences with Hamming Distance
21
1. Length of code words might be different E.g. 1 insertion, |c| = n+1 => insertion-1 code E.g. 1 deletion, |c| = n-1 => deletion-1 code
2. Different code words might have different ball sizes
3. No known perfect codes or explicit construction
ababa…a(n+1)
aaba…a(n) abba…a
(n)
abaa…a(n)
baba…a(n)
…
… …
aaaaa…a(n+1)
aaaa..a(n)
![Page 22: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/22.jpg)
Insertion-1 Codes
22
Let n=5, |A|=a=4, code words are of length 6Letters as integers from 0 to (a-1): e.g. 0230, 1124, …Let si be the ith digit of s1. sum(s) = 2. score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18)3. R = Any a-1 consecutive residues:
e.g. {0,1,2}, {12,13,14}, {16,17,0}C = {003000, 303000, 003001, 003002, 200000, …}
|C| =
**factor a worse than best possible**
![Page 23: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/23.jpg)
Example: s=23010, sum(s)=24, score(s)=6
23
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 023010 Y323010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 203010 Y233010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230010 Y233010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230010 Y230310
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230100 Y230130
![Page 24: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/24.jpg)
Edit Distance Codes
24
Insertion/Deletion Size Explicit/Existence
Insertion-1 explicit
Deletion-1 explicit
Deletion-2 explicit
Deletion-1 existence
![Page 25: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/25.jpg)
Summary
25
1. Fuzzy Joins for Hamming and Edit Distance in One-round MR
2. Anchor Points Algorithm Covering Codes Flexible parallelism Better communication cost than naive
3. Explicit construction of Hamming distance covering codes
4. Explicit Construction of Edit distance covering codes
![Page 26: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/26.jpg)
Open Questions
26
Fuzzy Joins in MR Minimum communication for a given per-reducer
memory for 1 round MR algorithms? Know the answer for only Hamming Distance 1
How about multi-round MR algorithms? Covering Codes
Are there smaller codes? Can we construct smaller codes explicitly? What is the size of the smallest codes?
![Page 27: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/27.jpg)
Related Work
27
Fuzzy Joins in MR Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 Document Similarity Self-Join with MapReduce, Baraglia et. al.,
ICDM 2010 Efficient Parallel Set-similarity Joins Using MapReduce, Vernica
et. al., SIGMOD 2010 Efficient Similarity Joins for Near Duplicate Detection, Xiao et.
al., WWW 2008Covering Codes
Covering codes, Gary Cohen On Asymmetric Coverings and Covering Numbers, Applegate
et. al., Comb. Designs 2003 Asymmetric Binary Covering Codes, Cooper et. al., Comb.
Theory 2002
![Page 28: Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc](https://reader036.vdocuments.net/reader036/viewer/2022062301/568162af550346895dd3350f/html5/thumbnails/28.jpg)
28
Questions?