approximating edit distance in near-linear time

22
Approximating Edit Distance in Near- Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

Upload: urania

Post on 07-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Approximating Edit Distance in Near-Linear Time. Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT). Edit Distance. For two strings x,y  ∑ n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time

Alexandr Andoni (MIT)

Joint work with Krzysztof Onak (MIT)

Page 2: Approximating Edit Distance in Near-Linear Time

Edit Distance

For two strings x,y ∑n

ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution

Important in: computational biology, text processing, etc

Example:

ED(0101010, 1010101) = 2

Page 3: Approximating Edit Distance in Near-Linear Time

Computing Edit Distance

Problem: compute ed(x,y) for given x,y{0,1}n

Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80]

Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving

over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04]

Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-

Raskhodnikova-Rubinfeld-Sami’03]

Page 4: Approximating Edit Distance in Near-Linear Time

Computing via embedding into ℓ1

Embedding: f:{0,1}n → ℓ1

such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x)

Best embedding by [Ostrovsky-Rabani’05]: distortion = 2O(√log n)

Computation time: ~n2 randomized (and similar dimension)

Helps for nearest neighbor search, sketching, but not computation…

Page 5: Approximating Edit Distance in Near-Linear Time

Our result

Theorem: Can compute ed(x,y) in n*2O(√log n) time with 2O(√log n) approximation

While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

Page 6: Approximating Edit Distance in Near-Linear Time

Review of Ostrovsky-Rabani embedding

φm = embedding of strings of length m δ(m) = distortion of φm

Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b

Embed each block separately as follows…

m/b

X=

Page 7: Approximating Edit Distance in Near-Linear Time

Ostrovsky-Rabani embedding (II)

s

E1s= rec. embedding of the s substrings

Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Ei

s(x), Eis(y))

EMD(A,B) = min-cost bipartite matching

Finish by embedding TEMD into ℓ1 with small distortion

E2s E3

s Ebs

X=

T (thresholded)

Page 8: Approximating Edit Distance in Near-Linear Time

Distortion of [OR] embedding

Suppose can embed TEMD into ℓ1 with distortion (log m)O(1)

Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b]

For b=exp[√log m] δ(m) ≤ exp[O(√log m)]

Page 9: Approximating Edit Distance in Near-Linear Time

Why it is expensive to compute [OR] embedding

In first step, need to compute recursive embedding for ~n/b strings of length ~n/b

The dimension blows up

s

X=

E1s= rec. embedding of the s substrings

Page 10: Approximating Edit Distance in Near-Linear Time

Our Algorithm

For each length m in some fixed set L[n],compute vectors vi

mℓ1 such that ||vi

m – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] )

up to distortion δ(m) Dimension of vi

m is only O(log2 n) Vectors vi

m are computed inductively from vik for

k≤m/b (kL) Output: ed(x,y)≈||v1

n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|

=|y|)

i

z[i:i+m]

z=

x y

Page 11: Approximating Edit Distance in Near-Linear Time

Idea: intuition

For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vi

k, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Ei

s) Resulting φm(z[i:i+m]) have high dimension, >m/b…

Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors qi of O(log2

n) dimension such that ||qi-qj||1 ≈ ||qi-qj||1 up to O(log n) distortion.

Apply to vectors φm(z[i:i+m]) to obtain vectors vim of

polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as

there are only ~√log n steps, giving an additional distortion of only exp[O(√log n)]

||vim – vj

m||1 ≈ ed( z[i:i+m], z[j:j+m] )

Page 12: Approximating Edit Distance in Near-Linear Time

Idea: implementation

Essential step is:Main Lemma: fix n vectors viℓ1, of

dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}.

Then we can compute vectors qiℓ1k for

k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n

Computing qi’s takes O(n) time.

Page 13: Approximating Edit Distance in Near-Linear Time

Proof of Main Lemma

Graph-metric: shortest path on a weighted graph

Sparse: O(n) edges“low” = logO(1) nmin

k M is semi-metric on Mk with “distance”

dmin,M(x,y)=mini=1..kdM(xi,yi)

TEMD over n sets Ai

minlow ℓ1

high

minlow ℓ1

low

minlow tree-metric

sparse graph-metric

O(log2 n)

O(1)

O(log n)

O(log3n)

ℓ1low

O(log n)[Bourgain](efficient)

Page 14: Approximating Edit Distance in Near-Linear Time

Step 1

Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into min

O(log n) ℓ1M^p with O(log2n) distortion, w.h.p.

Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding)

Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal

to # of points in the cell Theorem [AIK]:

no contraction w.h.p. expected expansion = O(log2 n)

Just repeat O(log n) times

TEMD over n sets Ai

minlow ℓ1

high

O(log2 n)

Page 15: Approximating Edit Distance in Near-Linear Time

Step 2

Lemma 2: can embed an n point set from ℓ1M into

minO(log n) ℓ1

k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1

Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x=Ax, y=Ay: no contraction: ||x-y||1≥||x-y||1 (w.h.p.) 5-expansion: ||x-y||1≤5*||x-y||1 (with 0.01 probability)

Just use O(log n) of such embeddings

minlow ℓ1

high

minlow ℓ1

low

O(1)

Page 16: Approximating Edit Distance in Near-Linear Time

Efficiency of Step 1+2

From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into min

low ℓ1

low

Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai)

More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of O(n) time

Page 17: Approximating Edit Distance in Near-Linear Time

Step 3

Lemma 3: can embed ℓ1 over {0..M}p into min

O(log^2 n) tree-m, with O(log n) distortion.

For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate

minlow ℓ1

low

minlow tree-metric

O(log n)

Δ

Page 18: Approximating Edit Distance in Near-Linear Time

Step 4

Lemma 4: suppose have n points in minlow

tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size O(n) with distortion D.

minlow tree-metric

sparse graph-metric

O(log3n)

Page 19: Approximating Edit Distance in Near-Linear Time

Step 5

Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1

low with O(log n) distortion in O(m) time.

Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi

Need to compute the distance from each node to each Bi

For each Bi can compute its distance to each node using Dijkstra’s algorithm in O(m) time

sparse graph-metric

ℓ1low

O(log n)

Page 20: Approximating Edit Distance in Near-Linear Time

Summary of Main Lemma

Min-product helps to get low dimension (~small-size sketch) bypasses impossibility

of dim-reduction in ℓ1

Ok that it is not a metric, as long as it is close to a metric

TEMD over n sets Ai

minlow ℓ1

high

minlow ℓ1

low

minlow tree-metric

sparse graph-metric

O(log2 n)

O(1)

O(log n)

O(log3n)

ℓ1low

O(log n)

oblivious

non-oblivious

Page 21: Approximating Edit Distance in Near-Linear Time

Conclusion + a question

Theorem: can compute ed(x,y) in

n*2O(√log n) time with 2O(√log n) approximation

Question: can we do the following “oblivious” dimensionality reduction in ℓ1

Given n, construct a randomized embedding φ:ℓ1

M→ℓ1polylog n such that for any v1…vnℓ1

M, with high probability, φ has distortion logO(1) n on these vectors?

If φ exists, it cannot be linear [Charikar-Sahai’02]

Page 22: Approximating Edit Distance in Near-Linear Time