Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert...

22
Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU]

Transcript of Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert...

Page 1: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)Robert Krauthgamer [Weizmann Institute]

Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU]

Page 2: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)Robert Krauthgamer [Weizmann Institute]

Joint with: Alexandr Andoni [Microsoft SVC] Krzysztof Onak [CMU]

11011 001111101111011 00111

Page 3: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 3

GenericSearchEngine

Given two strings x,yn:

ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y.

ed( banana , ananas ) = 2

Edit Distance (Levenshtein distance)

Applications:

• Computational Biology

• Text processing

• Web search

Page 4: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 4

Basic task Compute ed(x,y) for input x,y n

O(n2) time [WF’74]

b a n a n a

a

n

a

n

a

s

2

56

1

1

1

1

1

1

222

22

22

2

22

2

2

3

33

33

3 3

3 44

4

445

5

D(i,j)= min

D(i-1, j-1) , if x[i]=y[j]

D(i, j-1) + 1

D(i-1, j) + 1

D(i,j) = ed( x[1:i], y[1:j] )

Faster algorithms?

Page 5: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 5

Faster Algorithms? Compute ed(x,y) for given x,y n

O(n2) time [WF’74] O(n2/log2 n) time [MP’80]

Linear time (or near-linear)? Specific cases (average, smoothed, restricted input) and variants

(block edit dist etc.) [U’83, LV’85, M’86, GG’88, GP’89, UW’90, CL’90, CH’98, LMS’98, U’85, CL’92, N’99, CPSV’00, MS’00,CM’02, AK’08, BF’08…]

2O(√log n) approximation [OR’05,AO’09], improving earlier nc-approximation [BEKMRRS’03,BJKK’04,BES’06]

Same “barrier” 2O(√log n)-approximation also for related tasks: Nearest neighbor search (text indexing), embedding into normed spaces,

sketching [OR’05]

Page 6: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 6

Results I

Theorem 1: Can approximate ed(x,y) within (log n)O(1/ε) factor in time n1+ε (for any ε>0).

Exponential improvement over previous factor 2O(√log n)

Fallout from the study of asymmetric query model …

Page 7: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 7

Approach: asymmetric query model “Compress” one string, x, to nε information

Use dynamic programming to compute ed(x,y) in n1+ε time

How to compress? Carefully subsample x…

Focus on sample-size (number of

queried positions) in x, for fixed y ? Obtain near-tight bounds

x

y

Page 8: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 8

Results II: Asymmetric Query Complexity Problem: Decide ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A Complexity = #queries into x (unlimited access to y)

n1-ε A

Θ(log n)Θ(log2 n)

Θ(log3 n)

Θ(logt n)# queries

n1/2-ε n1/2n1/3n1/4n1/t-εn1/(t+1)

Approximation: (log n)O(1/ε)

# Queries: O(nε)

Ω(nε/loglog n)

[n1/(t+1), n1/t-ε]

O(logt n)

Ω(logt n)

Page 9: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 9

Upper bound Theorem 2: can distinguish ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/A for

A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0).

Proof structure:

1. Characterize edit by “tree-distance” Txy Parameter b≥2 (degree) Txy ≈ ed(x,y) up to 6b*log n factor

2. Prune the tree to subsample x

x1 x2 xn

b

sampled positions in x

Page 10: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 10

Step 1: Tree distance Partition x into b blocks, recursively, for h=logbn levels

x[1:n]

x[1:⅓n] x[⅔n:n]

x[1] x[2] x[3]

x[⅓n:⅔n]

y[1:n]

y[u:u+⅓n]

x[u:u+⅓n]

Ti(s,u) = T-distance between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the block-length at level i

Page 11: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 11

Tree distance: recursive definition Recall Ti(s,u) = distance between x[s:s+ℓi] and y[u:u+ℓi]

Base case: Th(s,u)=Hamming(x[s],y[u])

Output: Txy=T0(s=1,u=1)

x[s:s+ℓi]

y[u:u+ℓi]

r0

x

y

Page 12: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 12

T-distance approximates edit distance Lemma: Txy≈ed(x,y) up to 6b*logbn factor.

Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05] All had approximation recurrence of the type

A(n) = c*A(n/b) + b

for c≥2 Solves to A(n) ≥ 2√log n factor for every choice of b

Our characterization has no multiplicative loss (c=1):A(n) = A(n/b) + b

Analysis inspired by algorithms for smoothed edit [AK’08]

Page 13: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 13

Step 2: Compute the tree distance For b=2, T-distance gives O(log n) approximation!

BUT know only how to compute T-distance in O(n2) time

Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and get 1+ε approximation

Pruning: subsample (log n)O(1) children out of each node Works only when ed(x,y) ≥ (n) Generally, must subsample

the tree non-uniformly, using

the Precision Sampling Lemma

b

sampled positions in x

Page 14: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 14

Key tool: non-uniform sampling Goal:

For unknown a1, a2, …an[0,1] Estimate their sum, up to an additive constant error Using only “weak” estimates a1, a2, …an

Sum Estimator Adversary

0. fix distribution U1. Fix a1,a2,…an (unknown)

2. pick “precisions” ui

(our algorithm: ui~U i.i.d.)3. provide a1,a2,…an

s.t. |ai-ai|<1/ui4. report S=S`(a1,…,u1,…) with |S – ∑ai `| < 1.

Page 15: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 15

Precision Sampling Goal: estimate ∑ai from {ai} s.t. |ai-ai|<1/ui. Precision Sampling Lemma: Can achieve WHP

additive error 1 and multiplicative error 1.5 with expected precision Eu_i~U[ui]=O(log n).

Inspired by a technique from [IW’05] for streaming (Fk moments) In fact, PSL gives simple & improved algorithms for Fk moments,

cascaded (mixed) norms, ℓp-sampling problems [AKO’10]

Also distant relative of Priority Sampling [DLT’07]

Page 16: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 16

Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization

recursively at each node If a node has very weak precision, can trim the entire sub-tree

Page 17: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 17

Lower Bound Theorem Theorem 3: Achieving approximation A=O(log7 n) for edit distance

requires asymmetric query complexity nΩ(1/loglog n). I.e., distinguishing ed(x,y)>n/10 vs ed(x,y)<n/10A

Implications: First lower bound to expose hardness from repetitiveness in edit

distance Contrast with edit on non-repetitive strings (Ulam’s distance)

Empirically easier (better algorithms are known for it) Yet, all previous lower bounds essentially equivalent for the two variants

[BEKMRRS’03, AN’10, KN’05, KR’06, AK’07, AJP’10]

But asymmetric query complexity: Ulam: 2-approx. with O(log n) queries [ACCL’04, SS’10] Edit: requires nΩ(1/loglog n) queries

Page 18: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 18

Lower Bound Techniques Core gadget: ¾(.) = cyclic shift operation

Observation: ed(x,¾j(x)) · 2j

Lower bound outline: exhibit lower bound via shifts Amplification by “composing” the hard instance recursively

We will see here: Theorem 4: Asymmetric query complexity of approximation n1/2 to

edit distance is Ω(log2 n)

Page 19: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 19

The Shift Gadget Lemma: Ω(log n) query lower bound for approximation A=n0.5. Hard distribution (x,y):

Fix specific z1, z2{0,1}n (random-looking) Set:

Formally: y=z1 and x=σj(z1 OR z2) and random j[n0.5]

An algorithm is a set queried positions: Q½[n], |Q|<<log n It “reads” (z1 OR z2) at positions Q+j

Claim: Both z1|Q+j and z2|Q+j close to uniform dist. on {0,1}|Q| up to ~2|Q|/n0.5 statistical distance

Hence |Q| ¸ Ω(log n), even for approximation A=n0.99

00101y= x=01101

00101¾j( )

¾j( )

) ed(x,y) · 2n0.5 [close]

) ed(x,y) ¸ n/10 [far]

Page 20: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 20

Amplification via Substitution Product Ω(log2 n) lower bound by amplification: “compose” two shift

instances Hard distribution (x,y):

Fix z1,z2{0,1}√n, w0,w1{0,1}√n and y=z1(w0,w1) (substitution) Choose either z=z1 (close) or z=z2 (far) x = z(w0,w1) but with random shifts j[n1/3] inside each block and

between blocks

Intuition: must distinguish z=z1 from z=z2 Must “learn” Ω(log n) positions i of z, and each requires reading Ω(log n)

further positions in the corresponding blocks wz[i]

00101 11011 00111

11011 001111101111011 00111

z1= w0= w1=

x=

Page 21: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 21

Towards the Full Theorem For the full theorem: recursive composition

Proof overview:

1. Define ®-similarity of k distributions (®≈information per query)

2. ®-similarity ) query lower bound 1/® (for adaptive algorithms)

3. Initial “Shift metric” has high ®-similarity (induction basis)4. ®-similarity amplified under substitution product (inductive step)

5. Prove edit distance concentrates well (requires large alphabet)

6. Can reduce large alphabet to binary (lossy, but done once)

Page 22: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni.

Polylog. Approx. for ED and the Asymmetric Query Complexity 22

Conclusion We compute ed(x,y) up to (log n)O(1/ε) approximation in n1+ε time

Via Asymmetric Query Complexity (new model)

Open questions: Do faster / limitations:

E.g. O(log2n) approximation in n1+o(1) time? Use these insights for related problems:

Nearest Neighbor Search? Sublinear-time algorithms (symmetric queries)? Embeddings? Communication complexity?

Further thoughts: Practical ramifications? Asymmetric queries model? Paradigm for “fast dynamic programming”?