Dynamic Edit Distance Table under a General Weighted Cost Function Heikki Hyyrö (University of...

38
Dynamic Edit Distance Table under a General Weighted Cost Function Heikki Hyyrö (University of Tampere, Finland) Kazuyuki Narisawa (Kyushu University, Japan) and Shunsuke Inenaga (Kyushu University, Japan)

Transcript of Dynamic Edit Distance Table under a General Weighted Cost Function Heikki Hyyrö (University of...

Dynamic Edit Distance Table under a General Weighted Cost Function

Heikki Hyyrö (University of Tampere, Finland) Kazuyuki Narisawa (Kyushu University, Japan)

and Shunsuke Inenaga (Kyushu University, Japan)

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

Edit Distance

minimum total cost d for transforming string x[1:n] to y[1:m]

x = prague, y = passage   Ins. = Del. = Sub. = 1

p r a g u e⇓ ⇓ ⇓ ⇓

p a s s a g e

Edit Distance= Sub. + Ins. + Ins. + Del.= 1+1+1+1= 4

Example

Edit Operation CostInsertion Ins.= δ(ε, b)Deletion Del.= δ(a, ε)Substitution Sub.= δ(a, b)

p r a g u e0 1 2 3 4 5 6

0 0 1 2 3 4 5 6

p 1 1 0 1 2 3 4 5

a 2 2 1 1 1 2 3 4

s 3 3 2 2 2 2 3 4

s 4 4 3 3 3 3 3 4

a 5 5 4 4 3 4 4 4

g 6 6 5 5 4 3 4 5

e 7 7 6 6 5 4 4 4

D

Dynamic Programming

a a

b 1 2

a 1 1

)1,1(

),(],[

),,(],1[

),,(]1,[

min],[

)0(),(],0[

)0(),(]0,[

1

1

njmi

bajiD

ajiD

bjiD

jiD

njbjD

miaiD

ji

i

j

i

hh

i

hh

     

     p r a g u e

0 1 2 3 4 5 60 0 1 2 3 4 5 6

p 1 1 0 1 2 3 4 5

a 2 2 1 1 1 2 3 4

s 3 3 2 2 2 2 3 4

s 4 4 3 3 3 3 3 4

a 5 5 4 4 3 4 4 4

g 6 6 5 5 4 3 4 5

e 7 7 6 6 5 4 4 4

D

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

p r a g u e p r a g u0 1 2 3 4 5 6 0 1 2 3 4 5

0 0 1 2 3 4 5 6 0 0 1 2 3 4 5

p 1 1 0 1 2 3 4 5 p 1 1 0 1 2 3 4

a 2 2 1 1 1 2 3 4 a 2 2 1 1 1 2 3

s 3 3 2 2 2 2 3 4 s 3 3 2 2 2 2 3

s 4 4 3 3 3 3 3 4 s 4 4 3 3 3 3 3

a 5 5 4 4 3 4 4 4 a 5 5 4 4 3 4 4

g 6 6 5 5 4 3 4 5 g 6 6 5 5 4 3 4

e 7 7 6 6 5 4 4 4 e 7 7 6 6 5 4 4

D D'

Right Increment/Decrement

•Right I/D of Edit Distance▫ input : D of strings A and B▫output : D’ of strings A and B’ ( B = B’a or Ba

= B’ )▫easy to compute▫ insert or delete right column of D → D’ : O(m)

decrement

increment

Left Increment/Decrement

• Left I/D of ED▫ input : D of strings A and B▫output : D of strings A and B’ ( B = aB’ or aB =

B’ )▫difficult to compute

values of left side effect to the values of right sider a g u e p r a g u e

0 2 3 4 5 6 0 1 2 3 4 5 60 0 1 2 3 4 5 0 0 1 2 3 4 5 6

p 1 1 1 2 3 4 5 p 1 1 0 1 2 3 4 5

a 2 2 2 1 2 3 4 a 2 2 1 1 1 2 3 4

s 3 3 3 2 2 3 4 s 3 3 2 2 2 2 3 4

s 4 4 4 3 3 3 4 s 4 4 3 3 3 3 3 4

a 5 5 5 4 4 4 4 a 5 5 4 4 3 4 4 4

g 6 6 6 5 4 5 5 g 6 6 5 5 4 3 4 5

e 7 7 7 6 5 5 5 e 7 7 6 6 5 4 4 4

D' D

decrement

increment

Contribution

•Propose an efficient algorithm for Left I/D problem with any nonnegative integer costs

• Left I/D problem▫ input : ED table D of strings A and B▫output : ED table D’ of strings A and B’

B = aB’ (decrement) B’ = aB (increment)

▫costs of operations are nonnegative integers

Applications

•Cyclic String Comparison [Landau et. al 1998]

•Computing Approximate periods [Schmidt 1998]

•Edit distance for sliding window

•String Kernel based on Edit distance▫kernel is mapping to high dimensional feature space

used in Support Vector Machine(classifier)

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

Related Work

•naïve method▫compute D’ from scratch▫O(nm) time

•Kim & Park algorithm [2004]▫Each operation has cost 1▫Compute difference representation DR of table D

Using Change Table Ch▫O(n+m) time

Definition

• Left Increment/Decrement Problem

• input : DR table of string A and B•output : DR’ table of string A and B’▫B = aB’ (decrement)▫B’ = aB (increment)

•Each cost (Ins., Del., Sub.) is a Non Negative Integer▫Kim & Park algorithm : each cost is 1

Difference Representation

],1[],[].,[ jiDjiDUjiDR

]1,[],[].,[ jiDjiDLjiDR

under minus upper

right minus left

p r a g u e0 1 2 3 4 5 6

0 0 1 2 3 4 5 6p 1 1 0 1 2 3 4 5a 2 2 1 1 1 2 3 4s 3 3 2 2 2 2 3 4s 4 4 3 3 3 3 3 4a 5 5 4 4 3 4 4 4g 6 6 5 5 4 3 4 5e 7 7 6 6 5 4 4 4

D

p r a g u e0 1 2 3 4 5 6

0

p 1 - 1 - 1 - 1 - 1 - 1 - 1a 2 1 0 - 1 - 1 - 1 - 1s 3 1 1 1 0 0 0s 4 1 1 1 1 0 0a 5 1 1 0 1 1 0g 6 1 1 1 - 1 0 1e 7 1 1 1 1 0 - 1

DR.U

p r a g u e0 1 2 3 4 5 6

0

p 1 - 1 1 1 1 1 1a 2 - 1 0 0 1 1 1s 3 - 1 0 0 0 1 1s 4 - 1 0 0 0 0 1a 5 - 1 0 - 1 1 0 0g 6 - 1 0 - 1 - 1 1 1e 7 - 1 0 - 1 - 1 0 0

DR.L

DR’ – DR

We need not update all cells

r a g u e p r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

0 0 0p 1 0 0 0 0 0 p 1 - 1 - 1 - 1 - 1 - 1 - 1 p 1 1 1 1 1 1a 2 1 - 1 - 1 - 1 - 1 a 2 1 0 - 1 - 1 - 1 - 1 a 2 1 0 0 0 0s 3 1 1 0 0 0 s 3 1 1 1 0 0 0 s 3 0 0 0 0 0s 4 1 1 1 0 0 s 4 1 1 1 1 0 0 s 4 0 0 0 0 0a 5 1 1 1 1 0 a 5 1 1 0 1 1 0 a 5 0 1 0 0 0g 6 1 1 0 1 1 g 6 1 1 1 - 1 0 1 g 6 0 0 1 1 0e 7 1 1 1 0 0 e 7 1 1 1 1 0 - 1 e 7 0 0 0 0 1

r a g u e p r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

0 0 0p 1 0 1 1 1 1 p 1 - 1 1 1 1 1 1 p 1 - 1 0 0 0 0a 2 0 - 1 1 1 1 a 2 - 1 0 0 1 1 1 a 2 0 - 1 0 0 0s 3 0 - 1 0 1 1 s 3 - 1 0 0 0 1 1 s 3 0 - 1 0 0 0s 4 0 - 1 0 0 1 s 4 - 1 0 0 0 0 1 s 4 0 - 1 0 0 0a 5 0 - 1 0 0 0 a 5 - 1 0 - 1 1 0 0 a 5 0 0 - 1 0 0g 6 0 - 1 - 1 1 0 g 6 - 1 0 - 1 - 1 1 1 g 6 0 0 0 0 - 1e 7 0 - 1 - 1 0 0 e 7 - 1 0 - 1 - 1 0 0 e 7 0 0 0 0 0

DR'.U

DR'.L

DR.U

DR.L

-

-

=

=

Change Table

•Ch[i, j] = D’[i, j] – D[i, j]• cost = 1▫values in Ch : –1, 0, 1▫ is separated into three areas

r a g u e p r a g u e p r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

0 0 1 2 3 4 5 0 0 1 2 3 4 5 6 0 - 1 - 1 - 1 - 1 - 1 - 1p 1 1 1 2 3 4 5 p 1 1 0 1 2 3 4 5 p 1 1 0 0 0 0 0a 2 2 2 1 2 3 4 a 2 2 1 1 1 2 3 4 a 2 1 1 0 0 0 0s 3 3 3 2 2 3 4 s 3 3 2 2 2 2 3 4 s 3 1 1 0 0 0 0s 4 4 4 3 3 3 4 s 4 4 3 3 3 3 3 4 s 4 1 1 0 0 0 0a 5 5 5 4 4 4 4 a 5 5 4 4 3 4 4 4 a 5 1 1 1 0 0 0g 6 6 6 5 4 5 5 g 6 6 5 5 4 3 4 5 g 6 1 1 1 1 1 0e 7 7 7 6 5 5 5 e 7 7 6 6 5 4 4 4 e 7 1 1 1 1 1 1

D' D Ch

- =

Affected Entries

•entries where DR’[i, j] ≠ DR[i, j]▫they must be updated▫affected entries are along the borders of three areas in Ch

r a g u e r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

0 0 0 -1 -1 -1 -1 -1

p 1 1 1 1 1 1 p 1 -1 0 0 0 0 p 1 0 0 0 0 0

a 2 1 0 0 0 0 a 2 0 -1 0 0 0 a 2 1 0 0 0 0

s 3 0 0 0 0 0 s 3 0 -1 0 0 0 s 3 1 0 0 0 0

s 4 0 0 0 0 0 s 4 0 -1 0 0 0 s 4 1 0 0 0 0

a 5 0 1 0 0 0 a 5 0 0 -1 0 0 a 5 1 1 0 0 0

g 6 0 0 1 1 0 g 6 0 0 0 0 -1 g 6 1 1 1 1 0

e 7 0 0 0 0 1 e 7 0 0 0 0 0 e 7 1 1 1 1 1

Ch

DR'.L - DR.LDR'.U - DR.U D' - D

r a g u e0 1 2 3 4 5 6

0 -1 -1 -1 -1 -1

p 1 0 0 0 0 0

a 2 1 0 0 0 0

s 3 1 0 0 0 0

s 4 1 0 0 0 0

a 5 1 1 0 0 0

g 6 1 1 1 1 0

e 7 1 1 1 1 1

Ch

Sketch of Kim & Park Algorithm

•Update affected entries▫scan borders in Ch, computing Ch and DR’

•Time Complexity : O(n+m)

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

General Costs

•Ch can be separated into more than three areas▫the number of areas depends on the costs▫the values are not limited to –1, 0, 1

•Kim & Park algorithm ▫ is specialized to the three area case▫can not be applied with general costs

r a g u e0 1 2 3 4 5 6

0 -2 -2 -2 -2 -2

p 1 -1 -1 -1 -1 -1

a 2 2 -1 -1 -1 -1

s 3 2 1 -1 -1 -1

s 4 2 1 1 -1 -1

a 5 2 2 1 1 -1

g 6 2 2 2 1 1

e 7 2 2 2 2 1

Ch

Ins. = 2, Del. = 2, Sub. = 1Example

Our Algorithm

•Update only affected entries ▫without Ch▫compute only DR’.U and DR’.L

•Time complexity : O(min{c(n+m), nm})▫c is the maximum cost

r a g u e0 1 2 3 4 5 6

0 -2 -2 -2 -2 -2

p 1 -1 -1 -1 -1 -1

a 2 2 -1 -1 -1 -1

s 3 2 1 -1 -1 -1

s 4 2 1 1 -1 -1

a 5 2 2 1 1 -1

g 6 2 2 2 1 1

e 7 2 2 2 2 1

Chr a g u e0 1 2 3 4 5 6

0

p 1 1 1 1 1 1a 2 3 0 0 0 0s 3 0 2 0 0 0s 4 0 0 2 0 0a 5 0 1 0 2 0g 6 0 0 1 0 2e 7 0 0 0 1 0

r a g u e0 1 2 3 4 5 6

0

p 1 - 3 0 0 0 0a 2 0 - 3 0 0 0s 3 0 - 1 - 2 0 0s 4 0 - 1 0 - 2 0a 5 0 0 - 1 0 - 2g 6 0 0 0 - 1 0e 7 0 0 0 0 - 1

DR’.U – DR.U DR’.L – DR.L D’ – D

Affected Entry

•DR’[i, j] ≠ DR[i, j]

•Kim & Park Algorithm ▫computes DR’ and Ch for computing Affected Entry

•Our Algorithm▫compute affected entry by only DR table▫use following lemma

DR’[i, j] is Affected Entry ⇔DR’[i–1, j].L ≠ DR[i–1 , j].LorDR’[i, j–1].U ≠ DR[i, j–1].U

comparison of pseudo codesOur Algorithm

1 for i =1 to m do2 prev [⊿ i] = i + 1; DR[i,1].U = δ(ai, ε);3 i = 1; j = 1; DR[0, j].L = δ(ε, bj); currIdx = 1; prevIdx = 1;4 while i ≦ m and j ≦ n do5 while i ≦ m do6 x = DR[i-1, j].L; y = DR[i, j-1].U7 z = min{x+δ(ai,ε), y+δ(ε,bj), δ(ai,bj)}8 new.L = z-y; new.U = z-x;9 old.L = DR[i, j].L; old.U = DR[i, j].U;

10 DR[i, j].L = new.L; DR[i, j].U = new.U;11 if old.U ≠ new.U then12 curr [⊿ currIdx] = i; currIdx = currIdx + 1;13 i = i + 1;14 if old.L = new.L then15 now = i;16 repeat17 i = prev [⊿ prevIdx]; prevIdx = prevIdx + 1;18 until i ≧ now19 curr [⊿ currIdx] = m + 1;20 Interchange the roles of the tables curr and ⊿ prev ;⊿21 currIdx = 1; i = prev [1];⊿ prevIdx = 2; j = j + 1;

Kim & Park Algorithm 1 Let k be the smallest index in A such that A[k] = B[1]2 i-1 = 0; j-1 = 1; i1 = k; j1 = 0; f (0) = 0; g(0) = k;3 finished-1 = false; finished1 = false;4 while ( finished-1 = false) or ( finished1 = false) do5 if i-1 < i1 – 1 then /* case1 */6 if j-1 > j1 + 1 then7 if j-1 > j1+1 then X = -1;8 else X = 0;9 Y = 0;

10 else11 if f (i-1) < j-1 then X = -1;12 else if g( j1) ≦ i-1 then X = 1;13 else X = 0;14 if g( j1) ≦ i-1 + 1 then Y = 1;15 else Y = 0;16 Z = -1;17 Ch[i-1+1, j-1]= min{ -DR[i-1+1, j-1+1].UL + X+δi-1+1,j-1+1, -DR[i-1+1, j-1+1].U+Z+1, -DR[i-1+1, j-1+1].L+Y+1};18 DR’[i-1+1, j-1].U = DR[i-1+1, j-1+1].U – Ch[i-1+1, j-1] + Z;19 DR’[i-1+1, j-1].L = DR[i-1+1, j-1+1].L – Ch[i-1+1, j-1] + Y;20 if Ch[i-1+1, j-1] = -1 then i-1 = i-1 + 1; f (i-1) = j-1;21 else j-1 = j-1 + 1;22 else if j1 < j-1-1 then /* case2 */23 if i1 > i-1 +1 then24 if g( j1) < i1 then X =1;25 else X = 0;26 Y = 0;27 else28 if g( j1) < i1 then X =1;29 else if f (i-1) ≦ j1 then X = 0;30 else X = 0;31 if f ( i1-1) ≦ j1 + 1 then Y=-1;32 else Y = 0;33 Z = 1;34 Ch[i1, j1+1]= min{ -DR[i1, j1+2].UL + X+δi1,j1+2, -DR[i1, j1+2].U+Y+1, -DR[i1, j1+2].L+Z+1};35 DR’[i1, j1+1].U = DR[i1, j-1+2].U – Ch[i1, j1+1] + Y;36 DR’[i1, j1+1].L = DR[i1, j-1+2].L – Ch[i1, j1+1] + Z;37 if Ch[i1, j1+1] = 1 then j1 = j1 + 1; g( j1) = i1;38 else i1 = i1 + 1;39 else /* case3 */40 if f (i-1 < j-1) then X = -1;41 else if g( j1) ≦ i-1 then X = 1;42 else X = 0;43 Y = -1; Z = 1;44 Ch[i-1+1, j-1]= min{ -DR[i-1+1, j-1+1].UL +X+δi-1+1,j-1+1, -DR[i-1+1, j-1+1].U+Y+1, -DR[i-1+1, j-1+1].L+Z+1};45 DR’[i-1+1, j-1].U = DR[i-1+1, j-1+1].U – Ch[i-1+1, j-1] + Y;46 DR’[i-1+1, j-1].L = DR[i-1+1, j-1+1].L – Ch[i-1+1, j-1] + Z;47 if Ch[i-1+1, j-1] = 1 then j-1 = j-1 + 1; j1 = j1 + 1; g( j1) = i1;48 else if Ch[i-1+1, j-1] = 1 then j-1 = j-1 + 1; j1 = j1 + 1; g( j1) = i1;49 else j-1 = j-1 + 1; i1 = i1 + 1;50 if (i-1 = m) or ( j-1 = n) then 51 finished-1 = true;52 if (i1 = m+1) or ( j1 = n-1) then 53 finished1 = true;

comparison of behaviors

0 1 2 3 4 5 6 7 … … … m

01234567…

……

n

our algorithm Kim & Park algorithm

0 1 2 3 4 5 6 7 … … … m

01234567…

……

n

0 1 2 3 4 5 6 7 … … … m

01234567…

……

n

0 1 2 3 4 5 6 7 … … … m

01234567…

……

n

Contents

•Edit Distance

• Left Increment/Decrement Edit Distance Problem

•Related Work

•Our Algorithm

•Experiments

•Summary

Experiments

• strings A[1:m] and B[1:m]▫Total time of computing representations of edit

distance between A and B[ j:m] for j = m, m–1,…, 1 left incremental computation

•Machine Specifications▫CentOS Linux▫Xeon 3.0GhHz▫16GB memory

Experiment 1

•Time comparison with naïve algorithm

• costs▫chosen randomly

Insertion = 137, Deletion = 116, Substitution = 242

•Random data▫alphabet size 2,3, …, 52▫string length 100, 200, …, 5000

Result 1

Result 1

Experiment 2

•Time comparison with Kim & Park algorithm

• costs▫Insertion = Deletion = Substitution = 1

•Random data▫alphabet size 2, 3, , …, 52▫string length 100, 200, …, 5000

Result 2

Result 2

Experiment 3

•TimeCompare with naïve algorithm

•Corpus ▫English(reuters news)

costs Insertion = 137, Deletion = 116, Substitution = 242

string length : 1000, 2000, 3000, 4000, 5000▫Protein data(canterbury corpus: E.coli)

costs proposed in [Kurtz 1996] string length : 1000, 2000, 3000, 4000, 5000

δ ε A C G Tε 0 3 3 3 3A 3 0 2 1 3C 3 2 0 2 1G 3 1 2 0 2T 3 2 1 2 0

Result 3

lengthTime (seconds)

Our algorithm Naïve algorithm

1000 0.04 1.50

2000 0.27 12.0

3000 0.71 40.4

4000 1.36 97.1

5000 2.29 189

lengthTime (seconds)

Our algorithm Naïve algorithm

1000 0.01 1.43

2000 0.09 11.5

3000 0.23 38.8

4000 0.43 92.8

5000 0.70 181

English News Protein Data

Summary

•Algorithm for Left I/D problem▫nonnegative integer costs▫O( min{c(n+m), nm} )

c is the maximum cost▫experimentally fast

Our Algorithm Naïve Algorithm Kim & Park Algorithm

Costs Non negative integer Real number 1

Tables to compute DR D DR and Ch

Time Complexity O( min{c(n+m), nm} ) O(nm) O(n+m)

Source code Simple Simple Cumbersome

Speed Fast Very Slow Slower

Related Work•naïve method▫compute D’ from scratch▫O(nm) time

•Kim & Park algorithm [2004]▫Each operation has cost 1▫Compute difference representation

DR → DR’ Using Change Table Ch

▫O(n+m) time

D

D’

DR, Ch

DR’, Ch

Edit Distance

O(nm)

O(nm)

O(1)

O(n+m)

O(n+m)

naïve Kim & Park