approximating edit distance in near-linear time

Approximating Edit Distance in Near-Linear Time

Alexandr Andoni (MIT)

Joint work with Krzysztof Onak (MIT)

Edit Distance

For two strings x,y ∑n

ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution

Important in: computational biology, text processing, etc

Example:

ED(0101010, 1010101) = 2

Computing Edit Distance

Problem: compute ed(x,y) for given x,y{0,1}n

Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80]

Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving

over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04]

Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-

Raskhodnikova-Rubinfeld-Sami’03]

Computing via embedding into ℓ1

Embedding: f:{0,1}n → ℓ1

such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x)

Best embedding by [Ostrovsky-Rabani’05]: distortion = 2O(√log n)

Computation time: ~n2 randomized (and similar dimension)

Helps for nearest neighbor search, sketching, but not computation…

Our result

Theorem: Can compute ed(x,y) in n*2O(√log n) time with 2O(√log n) approximation

While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

Review of Ostrovsky-Rabani embedding

φm = embedding of strings of length m δ(m) = distortion of φm

Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b

Embed each block separately as follows…

m/b

X=

Ostrovsky-Rabani embedding (II)

s

E1s= rec. embedding of the s substrings

Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Ei

s(x), Eis(y))

EMD(A,B) = min-cost bipartite matching

Finish by embedding TEMD into ℓ1 with small distortion

E2s E3

s Ebs

X=

T (thresholded)

Distortion of [OR] embedding

Suppose can embed TEMD into ℓ1 with distortion (log m)O(1)

Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b]

For b=exp[√log m] δ(m) ≤ exp[O(√log m)]

Why it is expensive to compute [OR] embedding

In first step, need to compute recursive embedding for ~n/b strings of length ~n/b

The dimension blows up

s

X=

E1s= rec. embedding of the s substrings

Our Algorithm

For each length m in some fixed set L[n],compute vectors vi

mℓ1 such that ||vi

m – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] )

up to distortion δ(m) Dimension of vi

m is only O(log2 n) Vectors vi

m are computed inductively from vik for

k≤m/b (kL) Output: ed(x,y)≈||v1

n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|

=|y|)

i

z[i:i+m]

z=

x y

Idea: intuition

For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vi

k, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Ei

s) Resulting φm(z[i:i+m]) have high dimension, >m/b…

Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors qi of O(log2

n) dimension such that ||qi-qj||1 ≈ ||qi-qj||1 up to O(log n) distortion.

Apply to vectors φm(z[i:i+m]) to obtain vectors vim of

polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as

there are only ~√log n steps, giving an additional distortion of only exp[O(√log n)]

||vim – vj

m||1 ≈ ed( z[i:i+m], z[j:j+m] )

Idea: implementation

Essential step is:Main Lemma: fix n vectors viℓ1, of

dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}.

Then we can compute vectors qiℓ1k for

k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n

Computing qi’s takes O(n) time.

Proof of Main Lemma

Graph-metric: shortest path on a weighted graph

Sparse: O(n) edges“low” = logO(1) nmin

k M is semi-metric on Mk with “distance”

dmin,M(x,y)=mini=1..kdM(xi,yi)

TEMD over n sets Ai

minlow ℓ1

high

minlow ℓ1

low

minlow tree-metric

sparse graph-metric

O(log2 n)

O(1)

O(log n)

O(log3n)

ℓ1low

O(log n)[Bourgain](efficient)

Step 1

Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into min

O(log n) ℓ1M^p with O(log2n) distortion, w.h.p.

Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding)

Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal

to # of points in the cell Theorem [AIK]:

no contraction w.h.p. expected expansion = O(log2 n)

Just repeat O(log n) times

TEMD over n sets Ai

minlow ℓ1

high

O(log2 n)

Step 2

Lemma 2: can embed an n point set from ℓ1M into

minO(log n) ℓ1

k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1

Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x=Ax, y=Ay: no contraction: ||x-y||1≥||x-y||1 (w.h.p.) 5-expansion: ||x-y||1≤5*||x-y||1 (with 0.01 probability)

Just use O(log n) of such embeddings

minlow ℓ1

high

minlow ℓ1

low

O(1)

Efficiency of Step 1+2

From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into min

low ℓ1

low

Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai)

More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of O(n) time

Step 3

Lemma 3: can embed ℓ1 over {0..M}p into min

O(log^2 n) tree-m, with O(log n) distortion.

For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate

minlow ℓ1

low

minlow tree-metric

O(log n)

∞

Δ

Step 4

Lemma 4: suppose have n points in minlow

tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size O(n) with distortion D.

minlow tree-metric

sparse graph-metric

O(log3n)

Step 5

Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1

low with O(log n) distortion in O(m) time.

Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi

Need to compute the distance from each node to each Bi

For each Bi can compute its distance to each node using Dijkstra’s algorithm in O(m) time

sparse graph-metric

ℓ1low

O(log n)

Summary of Main Lemma

Min-product helps to get low dimension (~small-size sketch) bypasses impossibility

of dim-reduction in ℓ1

Ok that it is not a metric, as long as it is close to a metric

TEMD over n sets Ai

minlow ℓ1

high

minlow ℓ1

low

minlow tree-metric

sparse graph-metric

O(log2 n)

O(1)

O(log n)

O(log3n)

ℓ1low

O(log n)

oblivious

non-oblivious

Conclusion + a question

Theorem: can compute ed(x,y) in

n*2O(√log n) time with 2O(√log n) approximation

Question: can we do the following “oblivious” dimensionality reduction in ℓ1

Given n, construct a randomized embedding φ:ℓ1

M→ℓ1polylog n such that for any v1…vnℓ1

M, with high probability, φ has distortion logO(1) n on these vectors?

If φ exists, it cannot be linear [Charikar-Sahai’02]

approximating edit distance in near-linear time

Documents