Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Advisor: Prof. R. C. T. Lee

Speaker: C. W. Lu

• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.

• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt

h m, and an error bound k.– Find all position i of T such that there exists an suf

fix S of T(1, i), d(S, P) ≦ k.

• The approach of this paper is as the follows:

• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.

• Then we conduct an exact match of all such P’s against T.

• Example:

T=abbaaa,

P=aba and k=1.

From P and k, we generate the following P’s:

ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.

• How can we generate all P’s which we want?

• We use the following observation.

Let S be a substring of T, and S= S1S2.

P = P1P2.

If d(S1, P1) ≦k, and Dist(S2, P2) = 0,

d(S, P) ≦ k.

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

Consider the substring S = T(6, 11) = AAAACA,

Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

Consider the substring S = T(8, 11) = AACA,

Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.

• Consider P=aba, k=1.

• P=aba, k=1.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

abaa (Insertion) k = 1abab (Insertion) k = 1

• P=aba, k=2.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

abaa (Insertion) k = 1abab (Insertion) k = 1

• P=aba, k=2.

(k = 1)

a (Deletion) k = 2i = 2 aba (Insertion) k = 2

bba (Insertion) k = 2

aa (Substution) k = 2

ba k = 1

b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2

bb (Substution) k = 2

ba k = 1

baa (Insertion) k = 2

bab (Insertion) k = 2

For i=1 to m+1

PL’ PR’P’

k’=Dist(PL’, PL)≦k.

Dist(PR’, PR) = 0

iPL’ PR’

Deletion, k’++

PL’ PR’

CP’…

Replacement , k’++

PL’ PR’

CP’…

Insertion, k’++

PL’ PR’

P’ No operation.

Terminate if k’ > k.

• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.

• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

• This exact matching can be found by using the suffix array and the inverse suffix array.

Suffix Array

• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.

• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.

• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

nn- t...tttT 110

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Suffixes of T:

{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}

Lexicographic order:

$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.

= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5

9 1 3 2 7 8 0 4 6 5

0 1 2 3 4 5 6 7 8 9i

Inverse Suffix Array

• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are

lexicographically smaller then Ti.

Example:

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

SA[i]9

i SA-1[i]6

SA-1[SA[x] ] = x.

SA-1[0]=6 because there are 6 suffixes smaller than T0=

GACAGTTCG.

• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].

We write [st..ed ] = range(T, P).

Example:

0 1 2 3 4 5 6 7 8 9

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

SA[i]9

i P = G.

G is a prefix of T8, T0 and T4.

T8 = TSA[5]

T0 = TSA[6]

T4 = TSA[7]

st=5, ed=7,

range(T, P) = [5..7].

Lemma 1 (Gusfild [12])

Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

Lemma 2

Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

Let [st1..ed1] = range(T , P1),

[st2..ed2] = range(T , P2),

[st..ed] = range(T , P1P2).

[st..ed] is a subinterval of [st1..ed1].

Example:

0 1 2 3 4 5 6 7 8 9

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

SA[i]9

iP1 = G. P2 = A.

range(T, P1) = [5..7].

range(T, P1P2) must be

within [5..7].

How can we find the

exact interval with [5..7]?

• By the definition of suffix array, the lexicographic order of are increasing.

• The lexicographic order of

are also increasing.

][]1[][ 111 edSAstSAstSA , ..., T, TT

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

T2 = CAGTTCG$

T2+1 = T3 = AGTTCG$

T2+1 is obtained by deleting the prefix with length 1 from T2.

In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

i P1 = G. P2 = A.

range(T, P1) = [5..7].

][]1[][ 111 edSAstSAstSA , ..., T, TT

T8 < T0 < T4

T8+1, T0+1, T4+1

T9 < T1 < T5

• The lexicographic order of

are also increasing.

• Thus

• To find st and ed, we find the smallest st such that and the largest ed such that

|]|][[ ... |]|1][[ |]|][[ 11-1

11-1 PedSASAPstSASAPstSASA

2 |]|][[ edPstSASAst . |]|][[ 21

-12 edPedSASAst

Example:T G A C A G A T C G $

0 1 2 3 4 5 6 7 8 9

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)ATCG$. (T5)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GATCG$

(T4)TCG$

SA[i]9

i P1 = G. P2 = A.

range(T, P1) = [6..8].

6 ≦ st, ed ≦ 8

SA-1[i]7

range(T, P2) = [1..3].

range(T, P1P2) = [st..ed].

st = 7 and ed = 8.

3 1 1 1, 1][7][-1 SASA

3 3 1 3, 1][8][-1 SASA

1 0 0, 1][6][-1 SASA

• To find the interval of the first character of P:

We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.

range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

P = GACAGCA

C[A] = 2

C[C] = 4

C[G] = 7

C[T] = 9

range(T, p1)

= [C[C]+1…C[G] ]

= [5…7].

• Lemma 3

Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).

kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end

• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

References

• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.

• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.

• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on

• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro

c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE

I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.

in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co

mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and

don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE

EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.

• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,

• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i

ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu

tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu

ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i

n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM

2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197

7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p

p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit

hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C

omput. 22 (5) (1993) 935–948.

• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.

• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.

• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern

• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin

g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri

ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i

n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,

Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South

American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.

Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma

tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16

8–173.

Thank you!

Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

Documents

Transcript of Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

oz 00 O o a O LU LU o z u.] d O a a z a U- CD o o o LU LU ... · z u.] d O a a z a U- CD o o o LU LU 1.14 LU o z LU o a > a U] a z o a o IL LU c-rj o o LU o a U) CD a LU a z o N o

A survey on routing protocols for wireless sensor networks Speaker: Kuan-Ta Lu Advisor: Quincy Wu Date: June 9, 2010.

Speaker: L. C. Chen Advisor: R. C. T. Lee

Tongxin Lu Advisor: Prof. Ambrose Pope Benedict XVI writes in the Lu Levitt paper.pdf · 2011-04-05 · Tongxin Lu Advisor: Prof. Ambrose Pope Benedict XVI writes in the Letter to

Jimmy C Lu, Gregory J Ensing, Richard G Ohye, Jennifer C ... · Jimmy C Lu, Gregory J Ensing, Richard G Ohye, Jennifer C Romano, Peter Sassalos, Sonal T Owens, Thor Thorsson, Sunkyung

-*i+*m~ift3! J (§j}J~5m) LU-1510NA-7 PARTS LIST LU-1510N-7; LU-1510NA-7..." lli41'01 nGOJPLA.TE C ••·~ C l r.o Zll•3Z101 •11tm Pl.AT? l"AC&llil C ••·,r7,i,.• ,~ C

Advisor : P. C. Yu Speaker : G. S. Hong

Reducing DRAM Latency at Low Cost by Exploiting ......Thesis Committee Prof. Onur Mutlu (Advisor) Prof. Todd C. Mowry Prof. Kayvon Fatahalian Prof. Shih-Lien Lu Prof. Mattan Erez Carnegie

Chapter 15 C-implementation PA = LU Speaker: Lung-Sheng Chien.

1 Two Different Approximate String Matching Problems and Their Algorithms Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee.

Ovy i w ?; C=U}U-D lU}Q C QO =] Qt D VwQ Q |xQsjce.journals.sharif.edu/article_21457_10fe5df9d46ba6405aca5c550d1700fd.pdfq [= C =U} U - = D lU Q C} Q} O t xa r =] t =} ?; |=yxv=Nx}iYD

1 Morris-Pratt Algorithm Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu A linear pattern-matching algorithm, Technical Report 40, University of California,

1 KMP algorithm Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal.

Implementation of Decentralized Damage Localization in Wireless Sensor Networks Fei Sun Master Project Advisor: Dr. Chenyang Lu.

n°5 A A C lu b - OECD

Advisor : Prof. Yu-Chee Tseng Student : Yi-Chen Lu 12009/06/26.

StabilityofIntactChorionicGonadotropin(hCG)inSerum ......0 20 40 60 80 100 120 140 160 HOURSAT295K CUNICALCHEMISTRY,Vol.39,No.6,19931067 2. C) ‘U LU LU IL. z LU z 4z C) Lu > I-

LU> c. - Granicus

SCADA SAT (SSAT) - UK Sandra C Security Advisor Energy Dan B Security Advisor Water.

The Network Simulator NS-2 & SCTP Module Student ： Kuo-Lun Lu Advisor ： Dr. Jen-Yi Pan.

-i+m~ift3! J (§j}J~5m) LU-1510NA-7 PARTS LIST LU-1510N-7; LU-1510NA-7..." lli41'01 nGOJPLA.TE C ••·~ C l r.o Zll•3Z101 •11tm Pl.AT? l"AC&llil C ••·,r7,i,.• ,~ C