Improved Approximate String Matching and Regular ...

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Philip BilleIT University of Copenhagen

Rolf FagerbergUniversity of Southern Denmark

Inge Li GørtzTechnical University of Denmark

Approximate String Matching

• The edit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. E.g., edit-distance(“cocoa”, “cola”) = 2.

• Let and be strings and let (integer ) be an error threshold.

• The approximate string matching problem is to find all ending positions of substrings in whose edit distance to is at most .

P Q k

P kQ

> 0

Results

[Sellers1980]

[LV1989]

[CH2002]|Q| = u

Time Space

and

O(um)

O(uk)

O(m)

O(m)

O

�uk4

m+ u

⇥O(m)

|P | = m

Reference

Q = ananas

Z = (0,a)(0,n)(1,n)(1,s)

Z =

a

s n

n0

1 2

34z0 z1 z2 z3 z4

Ziv-Lempel 1978 compression

Approximate String Matching on ZL78 compressed texts

• Let be a string and be a ZL78 compressed representation of a string .

• Given and , the compressed approximate string matching problem is to solve the approximate string matching for and without decompressing .

• Goal: Do it more efficiently than decompressing and using the best (uncompressed) approximate string matching algorithm.

QP Z

P ZP Q Z

Z

Applications

• Textual data bases (e.g. DNA sequence collections) issues:

• Save space = keep data in compressed form.

• Search efficiently.

• Solution: Compressed string matching algorithms.

|Z| = n|P | = m

Results

• Let and .

• Kärkkäinen, Navarro, and Ukkonen [KNU2003]:

• time and space.

• Our result (Theorem 1): For any parameter :

• expected time and

• space.

O(n(� +m + t(m, 2m + 2k, k)) + occ)

O(n/� +m + s(m, 2m + 2k, k)) + occ)

O(nmk + occ) O(nmk)

� � 1

[KNU2003][KNU2003]

LV +

This paperCH +

This paper

� = mk

� = k4 +mO(nk4 + nm + occ)

O� nmk+m + occ

⇥

O

�n

k4 +m+m + occ

⇥

Example Results

Time Space

and

O(nmk + occ)

O(nmk + occ)

O(nmk)

|P | = m |Z| = n

Reference

Z =

Selecting Compression Elements

• For parameter , select a subset of the compression elements of Z such that:

• .

• From any compression element , the distance (minimum number of references) to any compression element in is at most .

� � 1 C

|C| = O(n/�)

ziC 2�

Z =


• Maintain using dynamic perfect hashing while scanning from left-to-right.

• Initially, set .

• To process element follow references until we encounter :

• If the distance from to is less than we are done.

• Otherwise ( ), insert element the element at distance into .

C Z

C = {z0}

y � Czi+1

l zi+1 y 2�

l = 2� � C

• Lemma: For any parameter , is constructed in


• space.

Z =


� � 1 C

O(n�)

O(n/�)

Q =

Computing Matches

• Strategy:

• Process from left-to-right.

• At we compute all matches ending in the substring encoded by .

Z

zi zi

phrase(zi)phrase(zi�1) . . .. . .

phrase(zi)phrase(zi�1)

rsuf(zi�1)

. . .. . .

rpre(zi)

Q =

Computing Overlapping Matches

• Let be the positions in of .

• Goal: Find all overlapping matches for , i.e., the matches starting before and ending in .

• Decompress substrings and of length around .

• Run favorite (uncompressed) approximate string matching algorithm to find matches of P in . Add offset to these to get the overlapping matches for .

[ui , ui + li � 1] Q phrase(zi)

zi[ui , ui + li � 1]

rpre(zi) rsuf(zi�1) m + k ui

rsuf(zi�1) · rpre(zi)zi

ui

m + k

zi

Computing the Relevant Prefix and Suffix

• For parameter , select a subset of the compression elements of according to Lemma 1.

• For each element in at distance more than from add “shortcut” to element at distance .

� � 1 C Z

C m + k z0m + k

Computing the Relevant Prefix

• Follow references to nearest element in .

• Follow shortcut if present.

• Compute the relevant prefix by decompressing length substring.

m + k

C

m + k

zi

Computing the Relevant Suffix

• Follow references to decompress substring of length .

• If the phrase is shorter than , recursively apply to until we have characters.

m + k

m + k

m + k zi�1 m + k

zi

Analysis

• Time = preprocess + n(find nearest element + decompress + match) =

• Space = preprocess + decompress + match =

O(n� + n(� +m + t(m, 2m + 2k, k))

O(n/� +m + s(m, 2m + 2k, k))

phrase(zi)phrase(zi�1) . . .. . .

rsuf �(zi)

Computing Internal Matches

• Goal: Find all internal matches for , i.e., all matches starting and ending within .

• Compute and store all the internal match sets indexed by compression elements using dynamic perfect hashing.

• Decompress substring of length ending at .

• Internal matches for =

[ui , ui + li � 1]zi

rsuf �(zi) min(li , m + k) ui + li � 1

zireference(zi)

�P rsuf �(zi)(internal matches for ) (matches of in )

Analysis

• Time = n (decompress + match + internal matches) =

• Space = decompress + match + total number of internal matches =

O(n(m + t(m,m + k, k)) + occ)

O(m + s(m,m + k, k) + occ)

Putting the Pieces Together

• Merging overlapping and internal matches we get all matches for ending within .

• Implies Theorem 1: For any parameter :


• space.

• Does not hold for ZLW compressed texts, unless space is used.

• For space the bounds hold in the worst-case and work for both ZL78 and ZLW.

[ui , ui + li � 1]zi

O(n(� +m + t(m, 2m + 2k, k)) + occ)

O(n/� +m + s(m, 2m + 2k, k)) + occ)

� � 1

�(n)

�(n)

Regular Expression Matching

• A regular expression is a generalized pattern composed from simple characters using union, concatenation, and Kleene star.

• Given a regular expression and a string the regular expression matching problem is to find all ending positions of substrings in that matches a string in the language generated by .

R Q

Q

R

Regular Expression Matching

• Let and .

• Classic solution [Thompson1968]: time and space.

• Several improvements based on the Four-Russian technique or word-level parallelism [Myers1992, NR2004, BFC2005, Bille2006].

|R| = m |Q| = u

O(um) O(m)

Compressed Regular Expression Matching

• Let and .

• Navarro [Navarro2003] simplified and without word-level parallel techniques:

• time and space.

• Our result (Theorem 2): For any parameter :

• time and

• space.

• E.g. gives time and space.

O(nm2 + occ ·m logm) O(nm2)

� � 1

O(nm(m + �) + occ ·m logm)

O(nm2/� + nm)

� = m O(nm2 + occ ·m logm) O(nm)

|R| = m |Z| = n

Remarks

• Compressed strings are large and therefore space space may not be feasible for large texts.

• Our result for compressed approximate string matching is one of the very few algorithms for compressed matching that uses space.

• More sublinear space compressed string matching algorithms are needed!

�(n)

o(n)

Improved Approximate String Matching and Regular ...

Documents

Transcript of Improved Approximate String Matching and Regular ...