cobinatorial algorithms for nearest neighbors, near-duplicates and small world design - yury...

Post on 29-Aug-2014

2.178 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Combinatorial Algorithmsfor Nearest Neighbors, Near-Duplicates

and Small-World Design

Yury LifshitsYahoo! Research

Shengyu ZhangThe Chinese University of Hong Kong

SODA 2009

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Similarity Search

Search space: object domain U, similarityfunction σ

Input: database S = {p1, . . . ,pn} ⊆ UQuery: q ∈ UTask: find argmax σ(pi,q)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29

Nearest Neighbors in TheorySphere Rectangle Tree Orchard’s Algorithm k-d-B tree

Geometric near-neighbor access tree Excludedmiddle vantage point forest mvp-tree Fixed-height

fixed-queries tree AESA Vantage-pointtree LAESA R∗-tree Burkhard-Keller tree BBD tree

Navigating Nets Voronoi tree Balanced aspect ratio

tree Metric tree vps-tree M-treeLocality-Sensitive Hashing SS-tree

R-tree Spatial approximation treeMulti-vantage point tree Bisector tree mb-tree Cover

tree Hybrid tree Generalized hyperplane tree Slim tree

Spill Tree Fixed queries tree X-tree k-d tree Balltree

Quadtree Octree Post-office treeYury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Contribution

Navin Goyal, YL, Hinrich Schütze, WSDM 2008:

Combinatorial framework: new approach to datamining problems that does not require triangleinequality

Nearest neighbor algorithm

This work:

Better nearest neighbor search

Detecting near-duplicates

Navigability design for small worlds

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Contribution

Navin Goyal, YL, Hinrich Schütze, WSDM 2008:

Combinatorial framework: new approach to datamining problems that does not require triangleinequality

Nearest neighbor algorithm

This work:

Better nearest neighbor search

Detecting near-duplicates

Navigability design for small worlds

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Outline

1 Combinatorial Framework

2 New Algorithms

3 Combinatorial Nets

4 Directions for Further Research

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29

1Combinatorial Framework

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29

Comparison Oracle

Dataset p1, . . . ,pn

Objects and distance (or similarity)function are NOT given

Instead, there is a comparison oracleanswering queries of the form:

Who is closer to A: B or C?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Combinatorial Framework

=

Comparison oracleWho is closer to A: B or C?

+

Disorder inequalityrankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29

Combinatorial Framework: FAQ

Disorder of a metric space? Disorder ofRk?

In what cases disorder is relatively small?

Experimental values of D for somepractical datasets?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29

Disorder vs. Others

If expansion rate is c, disorder constant isat most c2

Doubling dimension and disorderdimension are incomparable

Disorder inequality implies combinatorialform of “doubling effect”

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29

Combinatorial Framework: Pro & Contra

Advantages:

Does not require triangle inequality for distances

Applicable to any data model and any similarityfunction

Require only comparative training information

Limitation: worst-case form of disorder inequality

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

Combinatorial Framework: Pro & Contra

Advantages:

Does not require triangle inequality for distances

Applicable to any data model and any similarityfunction

Require only comparative training information

Limitation: worst-case form of disorder inequality

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

2New Algorithms

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29

Nearest Neighbor Search

Assume S ∪ {q} has disorder constant D

TheoremThere is a deterministic and exact algorithmfor nearest neighbor search:

Preprocessing: O(D7n log2n)

Search: O(D4 logn)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29

Near-Duplicates

Assume, comparison oracle can also tell uswhether σ(x,y) > T for some similaritythreshold T

TheoremAll pairs with over-T similarity can be founddeterministically in time

poly(D)(n log2n+ |Output|)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29

Visibility Graph

TheoremFor any dataset S with disorder D there existsa visibility graph:

poly(D)n log2n construction time

O(D4 logn) out-degrees

Naïve greedy routingdeterministically reachesexact nearest neighbor of the given target qin at most logn steps

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

3Combinatorial Nets

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29

Combinatorial Ball

B(x, r) = {y : rankx(y) < r}

In other words, it is a subset of dataset S: theobject x itself and r − 1 its nearest neighbors

xB(x,10)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29

Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r

How to construct a combinatorial net?What upper bound on its size can we guarantee?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r

How to construct a combinatorial net?What upper bound on its size can we guarantee?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Basic Data Structure

Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n

2i-net

Pointers, pointers, pointers:

Direct & inverted indices: links between centersand members of their balls

Cousin links: for every center keep pointers toclose centers on the same level

Navigation links: for every center keep pointers toclose centers on the next level

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Basic Data Structure

Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n

2i-net

Pointers, pointers, pointers:

Direct & inverted indices: links between centersand members of their balls

Cousin links: for every center keep pointers toclose centers on the same level

Navigation links: for every center keep pointers toclose centers on the next level

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Fast Net Construction

TheoremCombinatorial nets can be constructed inO(D7n log2n) time

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29

Up’n’Down Trick

Assume your have 2r-net for the dataset

To compute an r-ball around some object p:

1 Take a center p′ of 2r ball that is covering p

2 Take all centers of 2r-balls nearby p′

3 For all of them write down all members of theirs2r-balls

4 Sort all written objects with respect to p and keep rmost similar ones.

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29

4Directions for Further Research

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29

Future of Combinatorial FrameworkOther problems in combinatorial framework:

Low-distortion embeddingsClosest pairsCommunity discoveryLinear arrangementDistance labellingDimensionality reduction

What if disorder inequality has exceptions?

Insertions, deletions, changing metric

Experiments & implementation

Unification challenge: disorder + doubling = ?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29

Summary

Combinatorial framework:

comparison oracle + disorder inequality

New algorithms:

Nearest neighbor search

Deterministic detection of near-duplicates

Navigability design

Thanks for your attention!Questions?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Summary

Combinatorial framework:

comparison oracle + disorder inequality

New algorithms:

Nearest neighbor search

Deterministic detection of near-duplicates

Navigability design

Thanks for your attention!Questions?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Linkshttp://yury.name

http://simsearch.yury.nameTutorial, bibliography, people, links, open problems

Yury Lifshits and Shengyu Zhang

Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates andSmall-World Design

http://yury.name/papers/lifshits2008similarity.pdf

Navin Goyal, Yury Lifshits, Hinrich Schütze

Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search

http://yury.name/papers/goyal2008disorder.pdf

Benjamin Hoffmann, Yury Lifshits, Dirk Novotka

Maximal Intersection Queries in Randomized Graph Models

http://yury.name/papers/hoffmann2007maximal.pdf

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29

top related