set similarity · 2019-08-05 · before we start… let’s consider three internet technologies...

32
Rasmus Pagh IT University of Copenhagen Google Research BARC WADS, Edmonton, August 5, 2019 S CALABLE S IMILARITY S EARCH Set Similarity – a Survey 4 Set of Q&A

Upload: others

Post on 11-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Rasmus Pagh IT University of Copenhagen

Google ResearchBARC

WADS, Edmonton, August 5, 2019

06/02/2017, 08.30

Page 1 of 1file:///Users/pagh/Downloads/potrace-1.13.mac-x86_64/barc.svg

SCALABLESIMILARITYSEARCH

Set Similarity – a Survey

!4

Set of Q&A

Page 2: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Before we start…

Let’s consider three internet technologies launched around 20 years ago

!5

Page 3: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Recommendations

!6

Page 4: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Advanced search

!7

Page 5: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Wildcard operator

Edm?nt?n map

!8

Page 6: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Before we start…

What happened to wildcard search and to boolean expressions?

!9

Page 7: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Set of shopping carts with Canadian Train Ride

Set of shopping carts with Trans-American Train Ride

(4,o)

(1,E) (2,d) (3,m) (5,n) (6,t) (8,n)

(7,o)

Web pages containing “ballroom”

Web pages containing

“dance”Web pages containing

“salsa”

It’s all about sets

!10

Page 8: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Setting of this talk• We are given a collection of sets � that we

are allowed to preprocess.

• Seek answer to queries such as:

- Given � what is the size of � ? � ?

- Given a set � , is there an � such that � ? � ?

- Given a set � and an integer � , is there an � such that� ?

S1, …, Sn ⊆ U

i, j Si ∪ Sj Si ∩ Sj

Q i Q ⊆ Si Q ⊇ Si

Q t i|Q ∩ Si | ≥ t

!11

Similarity computation

Similarity search

Page 9: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!14

Similarity computation

Similarity search

Good news 3 4Bad news 1 2

Page 10: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Bad news• Query: Given � what is the size of � ?

• [Pǎtraşcu ’10], [Kopelowitz et al. ’14]:

- Assume we can preprocess sets � , each of size � , in time � such that it is possible to determine if � in time � .

- Then integer 3SUM can be solved in time � .

i, j Si ∩ Sj

S1, …, Sn ⊆ [n]n O(n 1.99)

Si ∩ Sj = ∅ O(n 0.49)O(n 1.991)

Suggests polylog ! query time not possible without

essentially precomputing all answers(n)

!15

Similarity computation

Page 11: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Bad news• Given a set � , is there an � such that � ?

• [Williams ’04], [Alman & Williams ’15]:- Assume we can preprocess sets � in

time poly� such that it is possible to determine if � , in time � .

- Then � such that k-SAT witn � variables can be solved in time � .

Q i Q ⊆ Si

S1, …, Sn ⊆ [n 0.01](n)

∃i : Q ⊆ Si O(n 0.99)∃c < 2 n

cn

Under strong exponential time hypothesis, this is not possible!

!16

Similarity search

Page 12: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

The good news…• We can now explain why nearly no progress on basic set

processing problems has been made since the 1970s.

• More constructively, it justifies looking at � -approximate versions of these problems:

- Given � what is the approximate size of � and � , up to a multiplicative error � ?

- Given a set � and an integer � , is there an � such that� or is � for all � ?

c

i, j Si ∪ SjSi ∩ Sj c > 1

Q t i|Q ∩ Si | ≥ t |Q ∩ Si | ≤ t/c i

!18

Page 13: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!19

Similarity computation

Similarity search

Good news 4Bad news 1 2

3

Page 14: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Similarity estimation attempt 2:

Coordinated sampling• Sample � where � independently with

probability � , and let � .

• Observe that � , and by Chernoff bounds � with probability � .

• Can estimate � if sampling rate � .

• Time to compute estimate is � .

U′� ⊆ U x ∈ U′�α S′�i = Si ∩ U′�

μ = E[ |S′ �i ∩ S′�j | ] = α |Si ∩ Sj ||S′�i ∩ S′�j | ≈ μ 1 − e− Ω(μ)

|Si ∩ Sj | ≈ |S′ �i ∩ S′�j | /αα ≫ 1/ |Si ∩ Sj |

|S′�i | + |S′ �j | ≈ α( |Si | + |Sj | )

[Brewer et al. ’72]

!20

Need to store set !U′�

Sample size is variable

Page 15: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

A mystery of alpine flowers

!21

Bulletin de la Société Vaudoise des Sciences Naturelles

Vol. XXXVn. N" 140. 1901

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

I

Dans un précédent mémoire1, la comparaison de la florealpine des trois régions : Trient, Bagnes, Wildhorn, m'a¬menait à conclure que la richesse en espèces et surtout laproportion des espèces spéciales à chacune des régionscomparées est sensiblement proportionnée à la variété deleurs conditions biologiques.Jusqu'à quel point cette conclusion est-elle générale?

C'est ce que je me propose d'établir dans le présent mé¬moire en m'occupant tout d'abord d'une exception appa¬rente à la conclusion que je viens de rappeler.

Il s'agit du Grand Saint-Bernard ct du val d'Entremont.

1 Ce travail est la suite d'un mémoire publié dans le Bulletin de la Soc. vau¬doise de l'année dernière, vol. XXXVI, et intitulé : Contribution au problèmede l'immigration de la flore alpine. Il reproduit en les développant les deuxnotes parues dans les Archives des Se. phys. et nat. de Genève, t. X, octobreiqoo : L'immigration post-glaciaire et la distribution actuelle de la flore al.pine dans quelques régions des Alpes, et dans les Comptes rendus du Congrèsinternational de botanique de Paris, 1900, p. 3i-38, Méthode de déterminationde la distribution de la flore alpine.

XXXVII IÖ

1901-1996:41 citations (Google scholar)

1997-2019:~2800 citations (Google scholar)

Page 16: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Min-wise hashing (aka. minhash)

• Pick random hash function � and define:�

• �

• Repeat � times to get sample of size � . Advantages:

- Coordinated samples without storing a set � .- Storage requirement is fixed.

h : U → [n 10]

minhashh(Si) = arg minx∈ Si

h(x)

Pr[minhashh(Si) = minhashh(Sj)] ≈ |Si ∩ Sj | / |Si ∪ Sj |

s sU′�

[Broder ’97]

!22

Page 17: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

Minhash estimation

• Pick random hash functions � , � .

• Create sketch vectors � , where � .

• Estimator: �

• �

ht : U → [n 10] t = 1,…, s

v(Si) v(Si)t = minhashht(Si)

X = 1s ∑

t1v(Si)t= v(Sj)t

E[X] ≈|Si ∩ Sj ||Si ∪ Sj |

= J(S1, S2)

[Broder ’97]

!23

!Si

!Sj

!v(Si)

!v(Sj)

�=�? �=�?�…

Var[X] = J(1 − J)s

Page 18: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

1-bit minhash• Idea: Compress the vector � to � .

• Use hash functions � , and define:�

• Estimator for Jaccard sim.: � .

v(Si) ∈ Us v′�(Si) ∈ {0,1}s

gt : U → {0,1}

v′�(Si)t = gt(v(Si)t)

X′� = 2s (∑

t1v′�(Si)t= v′�(Sj)t) − 1

!24

[Li and König ’09]

Var[X′�] = (1 + J)(1 − J)s

Factor ! larger than minhash

(1 + J)/J

Page 19: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Optimality of 1-bit minhash

• [P.-Stöckel-Woodruff ’14]: The variance of any estimator for Jaccard similarity based on � -bit summaries must be � for � .

• What happens when � is close to zero or one?

- Not much seems to be known about � .

- Experiments in [Li and König ’09] suggest that using � -bit minwise hashing is better for � .

s Ωε(1/s) J ∈ (ε,1 − ε)

J

J ≈ 0b

J ≈ 0

!25

Page 20: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!26

[Christiani ’18]

Lower variance for low similarities

0.2 0.4 0.6 0.8 1.0|Si∩Sj|/w

0.10.20.30.40.5

Hamming distance / s

CP hash1-bit minwise1-bit CP

1-bit minwise

• Choose � , where � indep.

• Parameter � is chosen s.t. � .

• Define � .

It ⊆ UPr[k ∈ It] = p

pPr[S∩ It = ∅] = 1

2

v′�′�′ �(S)t = 1S∩It≠∅

“CP hash”

Assume for simplicity that all sets have size !w

• Variance improves by factor almost 2 for small � .J

Page 21: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Lower variance for high similarities

!27

[Mitzenmacher, P., Pham ’14]

• Start with minhash � .

• 1-bit minhash: �

• Alternative binarization, “odd sketch”:

Use hash function � , define � .

• Can estimate � from � , error proportional to � .

v(Si) ∈ Uαs

v′�(Si)t = gt(v(Si)t)

g : U → {1,…, s}v′�′�(S)t = ∑

j1g(v(S)j)= t mod 2

|Si △ Sj | = |Si\Sj | + |Sj\Si |v′�′�(Si) ⊕ v′�′ �(Sj) 1 − J(S1, S2)

g g g(x) g(x’)

Page 22: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!28

Similarity estimation

Similarity search

Good news 3Bad news 1 2

4

Page 23: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Minhash for searching• Fix reals � .

• Query: Given � , find � such that � , assuming that for all � we have � .

• Data structure: Choose � such that � . For each set �store � in a hash table, with pointer to � .

• Query: Look up � in hash table, inspect linked set(s).• Analysis:

- Expected number of matching sets, � .

- Success probability � ; repeat until success.

1 > j1 > j2 > 0Q ⊆ U i J(Q, Si) ≥ j1

i ≠j J(Q, Sj) ≤ j2s js

2 ≈ 1/n Siv(Si) Si

v(Q)

E [∑i

1v(Q)= v(Si)] ≤ 2

js1 ≈ n − log( j1)/log( j2)

!29

[Indyk & Motwani ’98]

Page 24: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Is min-hash search optimal?

!30

Can we hope to beat � ?

• [Christiani-P. ’17], [Ahle ’19]: Improvement of the exponent is possible!

• [Chen-Williams ’19], [Stausholm-P.-Thorup ’19]: Assuming the Strong Exponential Time Hypothesis, time� requires that � .

O (n log( j1)/log( j2))

n 1− Ω(1) log( j1)/log( j2) < 1 − Ω(1)

Assume for simplicity that all sets have size !w

Page 25: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

ChosenPath algorithm

• Choose � , where � .

• Create recursive data structures for

sets � for �

until recursion depth � .

• Queries: For each � , recurse in subtree � (if it exists), perform exhaustive search at leaves.

I ⊆ U Pr[k ∈ I] = 1 + j12j1w

Xk = {Si | k ∈ Si} k ∈ I

⌈log(n)/log ( 1 + j22j2 )⌉

k ∈ Q Xk

31

[Christiani-P. ’17]

X = {Si | i = 1,…, n}

Page 26: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

ChosenPath analysis

32

• Suppose � . Then the set of “good” recursive calls � has expected size at least 1.

• In branching process terminology: expected number of offspring is at least 1 at each level of the recursion.

• Theory of branching processes [Agresti ’74] implies success probability � at level � .

• Repeat � times for constant success probability.

|Si ∩ Q | / |Si ∪ Q | ≥ j1k ∈ I ∩ Q ∩ Si

1/(λ + 1) λλ

x x'ySi SjQ

Page 27: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

• Combines ChosenPath with an idea of “supermajorities” inspired by angular LSH to get improved results for asymmetric sets, � .|Q | ≠|Si |

!33

Partial match

• Special case is “partial match” queries, � .|Q | = j1 |Si |

Page 28: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!34

Supermajorities for partial matchConsider

case where minhash leads to

search time ! .n

Page 29: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!35

Beyond set similarityIn many research communities: Hashing = mapping to ! .{0,1}s

Page 30: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

Some open problems

1. Is there a single sketch that is simultaneously space/variance optimal for low and high Jaccard similarity?

2.Known ! -bit sketches and estimators for Jaccard similarity are symmetric. Can asymmetry improve precision?

3.How many bits are needed to estimate Jaccard similarity up to factor ! when ! ?

s

1 + ε J → 0

!36

Similarity estimation

bit.ly/2T3laP0

Page 31: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

More open problems

4.We wish to choose ! from an explicit family of functions such

that ! .Is there an explicit such family of size ! ?

5.Similarity search in Euclidean/Hamming space can be made faster using data dependent LSH. What kind of speedup can be achieved for set similarity (maybe via embedding)?

6. Is the performance of Ahle’s supermajorities algorithm the best possible for LSH-based partial match?

hPr[minhashh(Si) = minhashh(Si)] = (1 ± ε)

|Si ∩ Sj |

|Si ∪ Sj |

O(poly(1/ε) log |U | )

!37

Similarity search

bit.ly/2T3laP0

Page 32: Set Similarity · 2019-08-05 · Before we start… Let’s consider three internet technologies launched around 20 years ago!5

!38

That’s all Folks!not

Timothy Chan, Saladi Rahul and Jie Xue. Range closest-pair search in higher dimensions

Boris Aronov, Omrit Filtser, Michael Horton, Matthew Katz and Khadijeh Sheikhan. Efficient Nearest-Neighbor Query and Clustering of Planar Curves

Timothy M. Chan, Yakov Nekrich and Michiel Smid. Orthogonal Range Reporting and Rectangle Stabbing for Fat Rectangles

Matteo Ceccarello, Anne Driemel and Francesco Silvestri. FRESH: Fréchet Similarity with Hashing