1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of...

52
1 Bit-parallel approximate string matching algorithm s with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

Transcript of 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of...

Page 1: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

1

Bit-parallel approximate string matching algorithms with transposition

Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

Page 2: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

2

This paper is based on “Fast text searching allowing errors” which there are three operation which are insertion, deletion and substitution. This paper uses a new operation transposition. We will use those operations to solve the approximate string matching problem under bit-parallel.

Transposition: It is transposing two adjacent characters in the string.

Example:

Insertion: cat → cast

Deletion: cat → at

Substitution: cat → car

Transposition: cta → cat

Page 3: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

3

But now, first we will use the three operations insertion, deletion and substitution to apply in the approximate string matching problem. Then the transposition will be introduced after the three operations.

Page 4: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

4

Our approximate string matching problem is defined as follows:

We are given a text T1,n, a pattern P1,m and an error bound k.

We want to find all locations Ti when 1≦i≦n such that there exists a suffix A of T1,i and d(A,P)≦k where d(x,y) is the edit distance (or difference) between x and y.

Example: T=deaabeg, P=aabac and k=2.

For i=5. T1,5= deaab.

We note that there exists a suffix A=aab of T1,5 such that d(A,P)=d(aab,aabac)=2.

Page 5: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

5

T

P

S2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 0, and d(S1, P1) ≦k,

we have d(S, P) ≦ k.

S1

S

P1 P2

Our approach is based upon the following observation:

Case 1 :

Page 6: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

6

Example:

A=addcd and B=abcd. k=2. We may decompose A and B as follows:A=add+cd.B=ab+cd.d(add,ab)=2.

Thus d(A,B)=2.

Page 7: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

7

T

P

S2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 1, and d(S1, P1) ≦k-1,

we have d(S, P) ≦ k.

S1

S

P1 P2

Case 2 :

Page 8: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

8

Example:

A=addb and B=dc. k=3. We may decompose A and B as follows:A=add+b.B=d+c.d(b,c)=1d(add,d)=2.

Thus d(A,B)=3.

Page 9: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

9

To solve our approximate string matching problem, we use a table, called .

where 1≦i≦n and 1≦j≦m.

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

11010

11100

11110

11111

11011

11001

11100

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=9, j=4.

T1,9=aabaacaab

P1,4=aaba

A=T7,9=aab

d(A,P1,4)=d(aab,aaba)=1

= 1 if there exists a suffix A of T1,i such that d(A,P1,j)≦k.= 0 otherwise.

),( jiRk

],[ mnRk

1)4,9(1 R

]5,13[1R

Page 10: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

10

It will be shown later that is obtained from . Thus, it is important for us to know how to find . will be obtained by Shift-And algorithm later under bit-parallel. But now, we will explain it in dynamic programming [S80](String Matching with Errors).

can be found by dynamic programming.

for all i. for all j.

= 1 if and ti=pj .

= 0 if otherwise.

From the above, we can see that the ith column of can be obtained from the (i-1)th column of , ti and pj.

1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9

aabac

12345

a a b a a c a a b 100000

Example:

Text = aabaacaab

pattern = aabac

kR0R 0R

1kR

),(0 jiR

1)0,(0 iR 0),0(0 jR

)1,1(0 jiR

0R

0R

),(0 jiR

Page 11: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

11

For instance, the 9th-colume is

0

0

0

1

1

0

0

1

0

012345

Example: T : aabaacaabacab, P : aabac and k=0.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

The 8th-colume is

12345

because and t9= p3 =‘ b’.

because and t9≠p2.

But, we do not use dynamic programming approach because we want to use the bit-parallel approach.

]5,13[0R

1)3,9(0 R

0)2,9(0 R

1)2,8(0 R

1)1,8(0 R

Page 12: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

12

Let us denote the ith column of by whose length equals to m.

For instance

=( 1,0,0,1,0 )

=( 1,1,0,0,0 )

To find we consider two cases:

Case 1: j=1

If ti=pj,

otherwise,

Example:

T : aabaacaabacab,

P : aabac and k=0.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

Case 2: j>1

If and ti=pj,

otherwise,

]5,13[0R

kR kiR

))5,4(),4,4(),3,4(),2,4(),1,4(( 0000004 RRRRRR

05R

),,(0 jiR

;1),(0 jiR ;1),(0 jiR

0),(0 jiR 0),(0 jiR

1)1,1(0 jiR

Page 13: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

13

We now define a right shift operation >>i on a vector A=(a1,a2,…,a

m) as follows:

That is, we delete the last i elements and add i 0’s at the front.

For instance, A=(1,0,0,1,1,1). Then A>>1=(0,1,0,0,1,1).

We use & and v to represent AND and OR operation. We further use to denote

For example

nmba ).,...,,...(nm

bbaa

)0,0,0,1(103

),...,,0,...0( 1 im

i

aaiA

Page 14: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

14

Consider , m=5.

=( 1,0,0,1,0 )

Example:

T : aabaacaabacab,

P : aabac and k=0.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

]5,13[0R

09R

)0,0,1,0,0(09 R

)0,0,0,0,1()0,1,0,0,0(10)1( 409 R

Page 15: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

15

Given an alphabet α and a pattern P, we need to know where α appears in P.

This information is contained in a vector Σ(α) which is defined as follows:Σ(α)[i]=1 if pi=α .Σ(α)[i]=0 if otherwise.

Example: P : aabacΣ(a) =( 1,1,0,1,0 )Σ(b) =( 0,0,1,0,0 )Σ(c) =( 0,0,0,0,1 )

Page 16: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

16

Σ(a) =( 1,1,0,1,0 ) and p4=t10=‘a’

Since and p4=t10, we have .

Since , we have .

Using the following rule:

Example:

T : aabaacaabacab,

P : aabac and k=0.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

]5,13[0R

1)3,9(0 R 1)4,10(0 R

0)2,9(0 R 0)3,10(0 R

Case 1: j=1

If ti=pj,

otherwise,

Case 2: j>1

If and ti=pj,

otherwise,

;1),(0 jiR ;1),(0 jiR

0),(0 jiR 0),(0 jiR

1)1,1(0 jiR

Page 17: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

17

Similarly, we can conclude that .1)1,10(0 R

]5,13[0RExample:

T : aabaacaabacab,

P : aabac and k=0.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

&)10)1(( 409

010 RR

=((0,0,0,1,0)v(1,0,0,0,0))&(1,1,0,1,0)

=(1,0,0,1,0)

Σ[t10]

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Page 18: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

18

Shift-And

In general we will update the i-th column of , which is , by using the formula for each new character of i . If , report a match at position i.

],[0 mnR0iR )(&)10)1(( 10

10

im

ii tRR

1),(0 miR

Page 19: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

19

j=4

a a b a a c a1 2 3 4 5 6 7

a b a c8 9

a b. . .

a

a a

a a b a

a a b a c

j=1

j=2

j=3

j=5

a a b

Example:

Text = aabaacaabacab

pattern = aabac

when i=1

10000

11010

10000

&

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

110000

1 1 1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 100000

The Shift-And formula

Σ(a)

The operation “&” will examine whether the Σ(ti) equals to the pj or not currently.

100 10)1( mR

)(&)10)1(( 101

0i

mi tRR

01R

00R 0

1R

01R

0)2,1(0 R

0)3,1(0 R

0)4,1(0 R

0)5,1(0 R

1)1,1(0 R

Page 20: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

20

j=4

a a b a a c a1 2 3 4 5 6 7

a b a c8 9

a b. . .

a

a a

a a b a

a a b a c

j=1

j=2

j=3

j=5

a a b

110000

1 1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 100000

111000

11000

11010

11000

&

The Shift-And formula

Example:

Text = aabaacaabacab

pattern = aabac

when i=2

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Σ(a)

02R

02R0

1R

02R

1)2,2(0 R

0)3,2(0 R

0)4,2(0 R

0)5,2(0 R

1)1,2(0 R

)(&)10)1(( 101

0i

mi tRR

100 10)1( mR

Page 21: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

21

j=4

a a b a a c a1 2 3 4 5 6 7

a b a c8 9

a b. . .

a

a a

a a b a

a a b a c

j=1

j=2

j=3

j=5

a a b

111000

1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 100000

100100

11100

00100

00100

&

110000

The Shift-And formula

Example:

Text = aabaacaabacab

pattern = aabac

when i=3

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Σ(b)

03R

03R0

2R

03R

0)2,3(0 R

1)3,3(0 R

0)4,3(0 R

0)5,3(0 R

0)1,3(0 R

)(&)10)1(( 101

0i

mi tRR

100 10)1( mR

Page 22: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

22

j=4

a a b a a c a1 2 3 4 5 6 7

a b a c8 9

a b. . .

a

a a

a a b a

a a b a c

j=1

j=2

j=3

j=5

a a b

100100

1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 100000

110010

10010

11010

10010

&

110000

111000

The Shift-And formula

Example:

Text = aabaacaabacab

pattern = aabac

when i=4

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Σ(a)

04R

04R0

3R

04R

0)2,4(0 R

0)3,4(0 R

1)4,4(0 R

0)5,4(0 R

1)1,4(0 R

)(&)10)1(( 101

0i

mi tRR

100 10)1( mR

Page 23: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

23

j=4

a a b a a c a1 2 3 4 5 6 7

a b a c8 9

a b. . .

a

a a

a a b a

a a b a c

j=1

j=2

j=3

j=5

a a b

110010

1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 100000

111000

11001

11010

11000

&

110000

111000

100100

The Shift-And formula

Example:

Text = aabaacaabacab

pattern = aabac

when i=5

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Σ(a)

05R

04R 0

5R

05R

1)2,5(0 R

0)3,5(0 R

0)4,5(0 R

0)5,5(0 R

1)1,5(0 R

)(&)10)1(( 101

0i

mi tRR

100 10)1( mR

Page 24: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

24

Example:

Text = aabaacaabacab

pattern = aabac 110000

111000

100100

110010

111000

100000

110000

111000

100100

110010

100001

110000

100000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b

aabaacaabacab aabac

100000

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

Page 25: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

25

Approximate String Matching

The approximate string matching problem is exact string matching problem with errors. We can use like Shift-And to achieve the

table. Here, we will introduce three operation which are insertion, deletion and substitution in approximate string matching.

Let , and denote the related to insertion, deletion and substitution respectively.

, and indicate that we can perform an insertion, deletion and substitution respectively without violating the error bound which is k.

],[ mnRk

),( jiRkI ),( jiRk

S),( jiRkD ),( jiRk

1),( jiRkI 1),( jiRk

S1),( jiRkD

where 1≦i≦n and 1≦j≦m.

= 1 if there exists a suffix A of T1,i such that d(A,P1,j)≦k.= 0 otherwise.

),( mnRk

Page 26: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

26

If ti=pj , and

Otherwise,

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=6 and j=5.

S=T1,6=aabaac and P1,5=aabac

∵t6=p5=‘c’ and R1(5,4)=1.

There exists a suffix A of S such that d(A,P) 1.≦

T:

P:

aabaa

aaba

i

j

i-1

c

c

j-1

.1),(,1)1,1( jiRjiR kk

).,(),(),(),( jiRjiRjiRjiR kS

kD

kI

k

1)5,6(1 R

Page 27: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

27

if ti≠pj and

otherwise.

1),( jiRkI

0),( jiRkI

T:

P:

aabac

aabac

b

i

j

i-1

Example: when k=1.

T:

P:

aabac

aabac

b

binsertion

i

j

i-1

Insertion

.1),1(1 jiRk

Page 28: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

28

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

(1) When i=6 and j=4.

t6=‘c’≠p4=‘a’,

(2) When i=11 and j=4.

t11=‘c’≠p4=‘a’,

10000

a a b a a c a a b a c a b 11000

11100

11110

11010

11001

11000

11000

11100

11110

10011

11001

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

if ti≠pj and

otherwise.

1),( jiRkI

0),( jiRkI

.1),1(1 jiRk

]5,13[0R

0)4,5(0 R

0)4,6(1 IR

1)4,10(0 R

0)4,11(1 IR]5,13[IR

Page 29: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

29

Example: Text = aabaacaabacab, Pattern = aabac and k=1

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

The insertion formula:

11101

00001

00001

11000

11001

Σ(c)&

v

aabaacaabacabaabaac insertion

11110

11010

11010

00100

11110

Σ(a)&

v

aabaacaabacabaaba insertion

10000

a a b a a c a a b a c a b 11000

11100

11110

11010

11001

11000

11000

11100

11110

10011

11001

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11

11 ))(&)10)1(((

kii

mki

ki RtRR

113 10)1( mR 1

6R14R

03R 0

5R

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

]5,13[0R ]5,13[1IR

115 10)1( mR

Page 30: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

30

T:

P:

aabac

aabac bdeletion

i

jj-1

Example: when k=1.

T:

P:

aabac

aabac b

i

jj-1

Deletion

if ti≠pj and

otherwise.

1),( jiRkD

0),( jiRkD

.1)1,(1 jiRk

Page 31: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

31

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

(1) When i=6 and j=4.

t6=‘c’≠p4=‘a’,

(2) When i=3 and j=4.

t3=‘b’≠p4=‘a’,

11000

a a b a a c a a b a c a b 11100

10110

11011

11100

10000

11000

11100

10110

11011

10001

11000

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

if ti≠pj and

otherwise.

1),( jiRkD

0),( jiRkD

.1)1,(1 jiRk

0)4,6(1 DR

0)3,6(0 R

1)4,3(1 DR

1)3,3(0 R

]5,13[0R

]5,13[1DR

Page 32: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

32

Example: Text = aabaacaabacab, Pattern = aabac and k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

The deletion formula:

11000

a a b a a c a a b a c a b 11100

10110

11011

11100

10000

11000

11100

10110

11011

10001

11000

10100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11100

00100

00100

10000

10100

Σ(b)&

v

aabaacaabacab

aabdeletion

11110

00100

00100

10010

10110

Σ(b)&

v

aabaacaabacabaaba deletion

))10()1(())(&)10)1((( 1111

mk

iimk

iki RtRR

)10(1 112

mR113R

)10(1 103

mR )10(1 1013

mR

)10(1 1112

mR13R

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

]5,13[0R ]5,13[1DR

Page 33: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

33

T:

P:

aabac

aabac a

b T:

P:

aabac

aabac bsubstitution

b

i

j

i-1

j-1

i

j

i-1

j-1

Example: when k=1.

Substitution

if ti≠pj and

otherwise.

1),( jiRkS

0),( jiRkS

.1)1,1(1 jiRk

Page 34: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

34

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

(1) When i=6 and j=4.

t6=‘c’≠p4=‘a’,

(2) When i=5 and j=5.

t3=‘b’≠p4=‘a’,

10000

a a b a a c a a b a c a b 11000

11100

11010

11001

11100

11010

11000

11100

11010

11001

11000

11100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

]5,13[0R

]5,13[1SR

if ti≠pj and

otherwise.

1),( jiRkS

0),( jiRkS

.1)1,1(1 jiRk

0)3,5(0 R

1)4,4(0 R

1)5,5(1 DR

0)4,6(1 DR

Page 35: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

35

Example: Text = aabaacaabacab, Pattern = aabac and k=1.

10000

a a b a a c a a b a c a b 11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

The substitution formula:

10000

a a b a a c a a b a c a b 11000

11100

11010

11001

11100

11010

11000

11100

11010

11001

11000

11100

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11100

00100

00100

11000

11100

&

v

aabaacaabacab

aabaabaacaabacab

cab11101

11010

11000

11001

11001

&

v

aabaacaabacabaabac aabaacaabacab

aabaa

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

]5,13[0R ]5,13[1sR

))10()1(())(&)10)1((( 111

11

mk

iimk

iki RtRR

1112 10)1( mR

104 10)1( mR

Σ(b)Σ(a)

114 10)1( mR

113R1

5R

1012 10)1( mR

Page 36: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

36

The algorithm which contains insertion, deletion and substitution allows k error. In the physical meaning, we just combine the formulas of insertion, deletion and substitution. So now, we combine the three formula using the “or” operation as follows:

Initially,

insertion deletion substitution

The insertion :1

11

1 ))(&)10)1(((

k

iimk

iki RtRR

The deletion : ))10()1(())(&)10)1((( 1111

mk

iimk

iki RtRR

The substitution : ))10()1(())(&)10)1((( 111

11

mk

iimk

iki RTRR

kmkkR 010

111

1111 10)1()1()())(&)1((

mki

ki

kii

ki

ki RRRtRR

ti=pj

and 1)1,1( jiRk

Page 37: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

37

Example: Text = aabaacaabacab pattern = aabac k=1

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 01111

11010

01010

00100

00010

01001

10000

11111

&

v

11000

a a b a a c a a b a c a b 11100

11110

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11111

aabaacaabacabaabac

00001:

00000:10

00

R

R

]5,13[0R

]5,13[1R

← for deletion

14R 11

3 R

03R

103 R

104 R

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

110 m

Σ(a)

Page 38: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

38

Example: Text = aabaacaabacab pattern = aabac k=1

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 01111

11010

01010

10010

01001

01100

10000

11111

&

v

aabaacaabacabaabaa

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

00001:

00000:10

00

R

R

]5,13[0R

]5,13[1R

← for insertion

15R 1

4R

04R

04R

05R

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

110 m

Σ(a)

Page 39: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

39

Example: Text = aabaacaabacab pattern = aabac k=1

10000

11000

00100

10010

11000

00000

10000

11000

00100

10010

00001

10000

00000

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

a a b a a c a a b a c a b 01111

00001

00001

11000

01100

00000

10000

11101

&

v

aabaacaabacab aab

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

aabaacaabacab aac

00001:

00000:10

00

R

R

]5,13[0R

]5,13[1R

← for substitution

16R 1

5R

05R

05R

06R

Σ(a)Σ(b)Σ(c)

*

11010001000000100000

110 m

Σ(c)

Page 40: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

40

11000

a a b a a c a a b a c a b 11100

11110

11111

11111

1 2 3 4 5 6 7 8 9 10111213

aabac

12345

11101

11010

11100

11110

11111

11011

11001

11100

Example: Text = aabaacaabacab pattern = aabac k=1

00001:

00000:10

00

R

RΣ(a)Σ(b)Σ(c)

*

11010001000000100000

]5,13[1R

Page 41: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

41

Transposition

Example: A=badec and B=bdace.

If transposition is allowed, we may transpose B as follow:

B=bdace → A=badec

Define for transposition:

if & ti=pj-1 & ti-1=pj

otherwise.

kTR

1),( jiRkT

0),( jiRkT

)2,2(1 jiRkT

Page 42: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

42

Initially,

We combine the operations of insertion, deletion, substitution and transposition for k error. The formula is as follows:

)1)((&)(&)2( 12 iiki ttR

.0,01 10mkkmkk RR

1

11

1

11

1

10

)1(

)1(

)(

))(&)1((

m

ki

ki

ki

iki

R

R

R

tR

insertion

deletion

substitution

ti=pj and 1)1,1( jiRk

transposition

V

V

V

V

V

kiR

Page 43: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

43

Example: T=abcbcaccbb, P=acbcb and k=1.

10000

00000

00000

00000

00000

10000

01000

00000

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b00000

00000

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

011100101001010000000000000000100000010011110

&

v

the original partial text:

a b c

a c b

transpose “bc” to be “cb”

b

a

c

00100001010010100100

&

a bc bcaccbba cb

a bc bcaccbba bc

k 1≦k 0≦

So we can tackle this within k 1.≦

+

0000000 R

1000010 R

0000001 R

Σ(c)

Σ(b)Σ(c)>>1

13R

112 R

02R

102 R

103 R

201 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[0R ]5,13[1R

Page 44: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

44

Example: T=abcbcaccbb, P=acbcb and k=1.

10000

00000

00000

00000

00000

10000

01000

00000

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b00000

00000

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

011110010100101000000000000000100000001010111

&

v

the original partial text:

transpose “cb” to be “bc”

00010010100001000010

&

c c b

c b c

c

c

b

abcbcac cb bac bc

a

a

a

k 1≦k 0≦ +

So we can tackle this within k 1.≦

abcbcac cb bac cb

Σ(b)Σ(c)Σ(b)>>1

19R

118 R

08R

108 R

109 R

207 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[0R ]5,13[1R

0000000 R

1000010 R

0000001 R

Page 45: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

45

Example: T=abcbcaccbb, P=acbcb and k=2.

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

11100

11110

11111

11111

11111

11111

11110

11111

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b11111

11111

011110101001010111000111001111100000010011111

&

v

00110001010010100100

&

a bc bcaccbba cb

the original partial text:a b c

a c b

transpose “bc” to be “cb”

b

a

c

k 1≦k 1≦

So we can tackle this within k 2.≦

+

a bc bcaccbba bc

1000010 R

1100020 R

0000021 R

Σ(c)

Σ(b)Σ(c)>>1

23R

122 R

12R

112 R

113 R

211 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[1R ]5,13[2R

Page 46: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

46

Example: T=abcbcaccbb, P=acbcb and k=2.

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

11100

11110

11111

11111

11111

11111

11110

11111

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b11111

11111

011110010100101111100111101010100000001011111

&

v

the original partial text:

b c b

b b c

transpose “cb” to be “bc”c

b

b

00111010100001000010

&

a

a

a

ab cb caccbbac bc

ab cb caccbbab bc

k 1≦k 1≦ +

So we can tackle this within k 2.≦

1000010 R

1100020 R

0000021 R

Σ(b)Σ(c)Σ(b)>>1

24R

123 R

13R

113 R

114 R

212 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[1R ]5,13[2R

Page 47: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

47

Example: T=abcbcaccbb, P=acbcb and k=2.

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

11100

11110

11111

11111

11111

11111

11110

11111

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b11111

11111

011110101001010101010101001101100000010111111

&

v

00111001010010100101

&

the original partial text:

c b c

c c b

transpose “cb” to be “bc”b

c

c

abc bc accbbc bc

abc bc accbba cb

k 1≦k 1≦ +

So we can tackle this within k 2.≦

1000010 R

1100020 R

0000021 R

Σ(c)Σ(b)Σ(c)>>1

25R

124 R

14R

114 R

115 R

213 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[1R ]5,13[2R

Page 48: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

48

Example: T=abcbcaccbb, P=acbcb and k=2.

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

11100

11110

11111

11111

11111

11111

11110

11111

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b11111

11111

011110101001010101010101001101100000010111111

&

v

00111001010010100101

&

the original partial text:

c b c

c c b

transpose “cb” to be “bc”b

c

c

abc bc accbbabc bc

abc bc accbbacb cb

k 1≦k 1≦ +

So we can tackle this within k 2.≦

b

b

b

a

a

a

1000010 R

1100020 R

0000021 R

Σ(c)Σ(b)Σ(c)>>1

25R

124 R

14R

114 R

115 R

213 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[1R ]5,13[2R

Page 49: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

49

Example: T=abcbcaccbb, P=acbcb and k=2.

11000

11100

11110

10101

11010

11000

11100

11110

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b10111

10001

11100

11110

11111

11111

11111

11111

11110

11111

1 2 3 4 5 6 7 8 9 10

acbcb

12345

a b c b c a c c b b11111

11111

011110010100101101010101001101100000001011111

&

v

00111010100001000010

&

the original partial text:

c b c

c c b

transpose “cb” to be “bc”b

c

c

abcbcac cb bac bc

k 1≦k 1≦ +

So we can tackle this within k 2.≦

b

b

b

a

a

a

1000010 R

1100020 R

0000021 R

Σ(b)Σ(c)Σ(b)>>1

29R

124 R

14R

114 R

115 R

217 R

Σ(a)Σ(b)Σ(c)

*

10000001010101000000

110 m

]5,13[1R ]5,13[2R

Page 50: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

50

Time complexity

)( nw

mkO

k: error value m: the length of patternn : the length of textw : the size of word of computer

Page 51: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

51

Reference

[WM92] Fast text searching: allowing errors, Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91

Page 52: 1 Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp. 215-229 Advisor:

52

~ Thanks for your attention ~