Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Post on 22-Jan-2016

39 views 0 download

description

Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249. Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu. - PowerPoint PPT Presentation

Transcript of Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

1

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Advisor: Prof. R. C. T. Lee

Speaker: C. W. Lu

2

• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.

• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt

h m, and an error bound k.– Find all position i of T such that there exists an suf

fix S of T(1, i), d(S, P) ≦ k.

3

• The approach of this paper is as the follows:

• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.

• Then we conduct an exact match of all such P’s against T.

4

• Example:

T=abbaaa,

P=aba and k=1.

From P and k, we generate the following P’s:

ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

5

• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.

• How can we generate all P’s which we want?

• We use the following observation.

6

T

P

S2

Let S be a substring of T, and S= S1S2.

P = P1P2.

If d(S1, P1) ≦k, and Dist(S2, P2) = 0,

d(S, P) ≦ k.

S1

S

P1 P2

7

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(6, 11) = AAAACA,

Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

8

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(8, 11) = AACA,

Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

9

• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.

• Consider P=aba, k=1.

10

• P=aba, k=1.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

11

• P=aba, k=2.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

12

• P=aba, k=2.

ba

(k = 1)

a (Deletion) k = 2i = 2 aba (Insertion) k = 2

bba (Insertion) k = 2

aa (Substution) k = 2

ba k = 1

i = 3

b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2

bb (Substution) k = 2

ba k = 1

i = 4

baa (Insertion) k = 2

bab (Insertion) k = 2

13

For i=1 to m+1

PL’ PR’P’

k’=Dist(PL’, PL)≦k.

Dist(PR’, PR) = 0

iPL’ PR’

P’

iPL

PR

P

Deletion, k’++

A

PL’ PR’

P’

CP’…

Replacement , k’++

A

PL’ PR’

P’

CP’…

Insertion, k’++

PL’ PR’

P’ No operation.

i

Terminate if k’ > k.

14

• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.

• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

15

• This exact matching can be found by using the suffix array and the inverse suffix array.

16

Suffix Array

• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.

• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.

• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

nn- t...tttT 110

17

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Suffixes of T:

{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}

Lexicographic order:

$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.

= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5

SA[i]

9 1 3 2 7 8 0 4 6 5

0 1 2 3 4 5 6 7 8 9i

18

Inverse Suffix Array

• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are

lexicographically smaller then Ti.

19

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i SA-1[i]6

1

3

2

7

9

8

4

5

0

SA-1[SA[x] ] = x.

SA-1[0]=6 because there are 6 suffixes smaller than T0=

GACAGTTCG.

20

• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

21

• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].

We write [st..ed ] = range(T, P).

22

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P = G.

G is a prefix of T8, T0 and T4.

T8 = TSA[5]

T0 = TSA[6]

T4 = TSA[7]

st=5, ed=7,

range(T, P) = [5..7].

23

Lemma 1 (Gusfild [12])

Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

24

Lemma 2

Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

25

Let [st1..ed1] = range(T , P1),

[st2..ed2] = range(T , P2),

[st..ed] = range(T , P1P2).

[st..ed] is a subinterval of [st1..ed1].

26

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

iP1 = G. P2 = A.

range(T, P1) = [5..7].

range(T, P1P2) must be

within [5..7].

How can we find the

exact interval with [5..7]?

27

• By the definition of suffix array, the lexicographic order of are increasing.

• The lexicographic order of

are also increasing.

][]1[][ 111 edSAstSAstSA , ..., T, TT

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

28

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

T2 = CAGTTCG$

T2+1 = T3 = AGTTCG$

T2+1 is obtained by deleting the prefix with length 1 from T2.

In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.

29

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [5..7].

][]1[][ 111 edSAstSAstSA , ..., T, TT

T8 < T0 < T4

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

T8+1, T0+1, T4+1

T9 < T1 < T5

30

• The lexicographic order of

are also increasing.

• Thus

• To find st and ed, we find the smallest st such that and the largest ed such that

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

|]|][[ ... |]|1][[ |]|][[ 11-1

11-1

11-1 PedSASAPstSASAPstSASA

21-1

2 |]|][[ edPstSASAst . |]|][[ 21

-12 edPedSASAst

31

Example:T G A C A G A T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)ATCG$. (T5)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GATCG$

(T4)TCG$

(T6)

SA[i]9

1

3

5

2

7

8

0

4

6

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [6..8].

6 ≦ st, ed ≦ 8

SA-1[i]7

1

4

2

8

3

9

5

6

0

range(T, P2) = [1..3].

range(T, P1P2) = [st..ed].

st = 7 and ed = 8.

3 1 1 1, 1][7][-1 SASA

3 3 1 3, 1][8][-1 SASA

1 0 0, 1][6][-1 SASA

32

• To find the interval of the first character of P:

We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.

range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

33

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i

P = GACAGCA

C[A] = 2

C[C] = 4

C[G] = 7

C[T] = 9

range(T, p1)

= [C[C]+1…C[G] ]

= [5…7].

34

• Lemma 3

Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

35

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).

kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end

36

• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

37

References

• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.

• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.

• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on

• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro

c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE

I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.

in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co

mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and

don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE

EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.

38

• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,

• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i

ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu

tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu

ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i

n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM

2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197

7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p

p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit

hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C

omput. 22 (5) (1993) 935–948.

39

• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.

• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.

• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern

• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin

g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri

ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i

n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,

Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South

American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.

Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma

tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16

8–173.

40

Thank you!