String Edit Distance Matching Problem With Moves

32
1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001

description

String Edit Distance Matching Problem With Moves. Graham Cormode S. Muthukrishna November 2001. Pattern Matching. Text T of length n Pattern P of length m Our goal : find “good” matches of P in T as measured by some edit distance function d(_,_) For every i= 1,2… n find: D[i] = back. - PowerPoint PPT Presentation

Transcript of String Edit Distance Matching Problem With Moves

Page 1: String Edit Distance Matching Problem With Moves

1

String Edit Distance MatchingProblem With Moves

Graham Cormode S. MuthukrishnaNovember 2001

Page 2: String Edit Distance Matching Problem With Moves

2

Pattern Matching

Text T of length n Pattern P of length m

Our goal: find “good” matches of P in T asmeasured by some edit distance function d(_,_)

For every i=1,2…n find: D[i] =

back

)],:[d(Minj

PjiT

Page 3: String Edit Distance Matching Problem With Moves

3

Pattern Matching example

T = “assasin” P = “ssi”

-D[1]=2 assasin Or assasin

_ssi ss i

-D[3]=1 assa sin s_si

-naively it will take O(mn^3)

Page 4: String Edit Distance Matching Problem With Moves

4

The main idea

d(X,Y) = smallest number of operations to turn X into Y.

The operations are insertion, deletion, replacement of a character and moves of a substring.

The idea is to approximate D[i] up to a factor of O(LognLog*n)

Page 5: String Edit Distance Matching Problem With Moves

5

General algorithm

Embed the string distance into L1 vectordistance, up to a O(LognLog*n) factor :

compute the vector with a single pass over the string.

Find the vector representation of O(n) substrings.

Do all this in O(nLogn) time. (a deterministicalgorithm with an aproximate result)

Page 6: String Edit Distance Matching Problem With Moves

6

Edit Sensitive Parsing (ESP)for the embedding

We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i )

For example:

“abcbbabcbdfgj” “xyzbbabcbdfgj”

“a bc bb abc b dfgj”“xy z bb abc b dfgj”

Page 7: String Edit Distance Matching Problem With Moves

7

Edit Sensitive Parsing (ESP)for the embedding

In practice, find landmarks in the strings, based only on their locality and parse by them.

Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so

we will reduce the alphabet.

Page 8: String Edit Distance Matching Problem With Moves

8

Edit Sensitive Parsing (ESP)choosing the landmarks

We use a technique called Alphabet reduction:

Text: c a b a g e f a c e dBinary:010 000 001 000 110 100 101 000 010 100

011

Label: 00 2 1 0 3 2 1 0 3 2 3

Label(A[i]) = 2l + bit(l,A[i])l = location of first bit that is different between A[i] and A[i-1]

The value of the l bit in A[i]

Page 9: String Edit Distance Matching Problem With Moves

9

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

- In each iteration the alphabet is reduced from Σ to 2Log|Σ|.- After Log*|Σ| iterations we get |Σ| < 7.- Then we reduce from 6 to 3, ensuring no adjacent

pairs are identical (start from 6 then 5 then 3):

Final iteration: 2 1 0 3 2 1 0 3 2 3

After reduction: 2 1 0 1 2 1 0 1 2 0

Page 10: String Edit Distance Matching Problem With Moves

10

Edit Sensitive Parsing (ESP)Alphabet reduction (cont.)

Properties of final labels:1) Final alphabet is {0,1,2}.2) No adjacent pair is identical.3) Takes Log*|Σ| iterations.4) Each label depends on O(Log*|Σ|) characters

to its left.

Page 11: String Edit Distance Matching Problem With Moves

11

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa)

For varying substrings, use alphabet reduction then mark landmarks as follows:

Page 12: String Edit Distance Matching Problem With Moves

12

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Page 13: String Edit Distance Matching Problem With Moves

13

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Then mark any local minima if not adjacent to a marked char.

Page 14: String Edit Distance Matching Problem With Moves

14

Edit Sensitive Parsing (ESP)So how do we choose the landmarks?

Consider the final labels, Mark any character that is a local maxima (greater than left & right):

Text: c a b a g e f a c e dFinal: 010 2 1 0 1 2 1 0 1 2 0

Then mark any local minima if not adjacent to a marked char.

Clearly, distance between marked labels is 2 or 3.

Page 15: String Edit Distance Matching Problem With Moves

15

Edit Sensitive Parsing (ESP)What did we achieve so far?

By now the whole string has been arranged in pairs and triples.

The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)

Page 16: String Edit Distance Matching Problem With Moves

16

Edit Sensitive Parsing (ESP)constructing the ESP tree

We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987).

Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q

For some large q

Page 17: String Edit Distance Matching Problem With Moves

17

Edit Sensitive Parsing (ESP)constructing the ESP tree

In O(nLogn) construction time we get a 2-3 tree:

Page 18: String Edit Distance Matching Problem With Moves

18

How do we represent a 2-3 tree as a vector?

The vector is the frequency of occurrence of each (level,label) pair:

(0,a) (0,b) (0,c) (0,d) (0,e) (0,f) (0,g) (0,_)

8 7 1 4 6 1 4 5 (1,2) (1,3) (1,6) (1,7) (1,8) (1,10) (1,12) (1,14) (1,16) (1,20) (1,21)

2 1 1 1 1 1 2 1 3 1 2(2,5) (2,7) (2,10) (2,13) (2,17) (2,20) (3,3) (3,15) (3,23) (4,10)

1 1 1 2 1 1 1 1 1 1

Page 19: String Edit Distance Matching Problem With Moves

19

Proof of correctness

Theorem:

1/2d(X,Y) V(X) – V(Y) 1

O(lognlog*n)d(X,Y)

Page 20: String Edit Distance Matching Problem With Moves

20

Upper bound proof:

V(X) – V(Y) 1 O(lognlog*n)d(X,Y)

Insert/change/delete a character: at most log*n + c nodes change per tree level.

Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level.

since there are logn levels in the tree: Conclusion: each operation changes V by

O(lognlog*n) and there are d(X,Y) changes.

Page 21: String Edit Distance Matching Problem With Moves

21

Lower bound proof:

d(X,Y) 2 V(X) – V(Y) 1

The Idea is to transform X into Y using at most

2 V(X) – V(Y) 1 operations:

We want to keep hold of large pieces of the stringthat are common to both X and Y.So we will go through and protect enough pieces

ofX that will be needed in Y.

Page 22: String Edit Distance Matching Problem With Moves

22

Lower bound proof: (cont.)

We avoid doing any changes in the protected pieces.

At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X).

So we get V0(X)-V0(Y) 1 = 0. We proceed inductively up on the tree, then to

make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: Vi-1(X)-Vi-

1(Y) 1 = 0)

Page 23: String Edit Distance Matching Problem With Moves

23

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

C AB BA

F

CB

E

CB

E

K M

LMark protected pieces:

Counter=0

Page 24: String Edit Distance Matching Problem With Moves

24

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

C AB BA

F

CB

E

CB

E

K M

LRemove and add characters as needed:

Counter=0Counter=1

(deletion)

A

Counter=2

(insertion)

Page 25: String Edit Distance Matching Problem With Moves

25

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

J

AB BA

F

CB

E

CB

E

K M

L

A

Move to level 2, move nodes in level 1 as needed:

Counter=2

(insertion)

Counter=3

(1 move)

Page 26: String Edit Distance Matching Problem With Moves

26

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

D

AB BA

F

CB

E

CB

E

K M

L

A

Move to level 3, move nodes in level 2 as needed:

This node will not move

Counter=5

(2 moves)

Counter=3

(1 move)

Page 27: String Edit Distance Matching Problem With Moves

27

Lower bound proof: example

Y:

X:

D

A AB CB

E

BA

F

CB

E

G H

I

D

AB CB

E

BA

F

CB

E

G H

I

A

Counter=5

(2 moves)

That’s it!!

Page 28: String Edit Distance Matching Problem With Moves

28

Lower bound proof: (example with d(X,Y)=1)

Did we achieve what we wanted? Y X

2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18

And we surely created X from Y in less then 18 moves

DA AB CB

EBAF

CBE

G H

I

JC AB BA

FCBE

CBE

K ML

Page 29: String Edit Distance Matching Problem With Moves

29

Application to string matching

To find D[i] , we need to compare P against all possible substrings of T. we can reduce this to O(n):

d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P)

So we only need to consider O(n) substrings of length mAnd we get a 2-approximation of the optimal

matching !!

)( 2nO

Since we need at least |r–m| operations to make T[1,r] the same length as P.

Page 30: String Edit Distance Matching Problem With Moves

30

Application to string matching – Final algorithm

Create ESP trees for T and P. Find V(T[1:m])-V(P) 1

Iteratively compute D[i] V(T[i:i+m-1])-V(P) 1

Overall we get O(nLogn) time cost for the wholealgorithm, and compute every D[i] up to a

factorof O(lognlog*n)

Page 31: String Edit Distance Matching Problem With Moves

31

Application to string matching – Final algorithm example:

Page 32: String Edit Distance Matching Problem With Moves

32