Parallel approximate string matching applied to occluded ...
Improved Approximate String Matching and Regular ...
Transcript of Improved Approximate String Matching and Regular ...
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Philip BilleIT University of Copenhagen
Rolf FagerbergUniversity of Southern Denmark
Inge Li GørtzTechnical University of Denmark
Approximate String Matching
• The edit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. E.g., edit-distance(“cocoa”, “cola”) = 2.
• Let and be strings and let (integer ) be an error threshold.
• The approximate string matching problem is to find all ending positions of substrings in whose edit distance to is at most .
P Q k
P kQ
> 0
Results
[Sellers1980]
[LV1989]
[CH2002]|Q| = u
Time Space
and
O(um)
O(uk)
O(m)
O(m)
O
�uk4
m+ u
⇥O(m)
|P | = m
Reference
Q = ananas
Z = (0,a)(0,n)(1,n)(1,s)
Z =
a
s n
n0
1 2
34z0 z1 z2 z3 z4
Ziv-Lempel 1978 compression
Approximate String Matching on ZL78 compressed texts
• Let be a string and be a ZL78 compressed representation of a string .
• Given and , the compressed approximate string matching problem is to solve the approximate string matching for and without decompressing .
• Goal: Do it more efficiently than decompressing and using the best (uncompressed) approximate string matching algorithm.
QP Z
P ZP Q Z
Z
Applications
• Textual data bases (e.g. DNA sequence collections) issues:
• Save space = keep data in compressed form.
• Search efficiently.
• Solution: Compressed string matching algorithms.
|Z| = n|P | = m
Results
• Let and .
• Kärkkäinen, Navarro, and Ukkonen [KNU2003]:
• time and space.
• Our result (Theorem 1): For any parameter :
• expected time and
• space.
O(n(� +m + t(m, 2m + 2k, k)) + occ)
O(n/� +m + s(m, 2m + 2k, k)) + occ)
O(nmk + occ) O(nmk)
� � 1
[KNU2003][KNU2003]
LV +
This paperCH +
This paper
� = mk
� = k4 +mO(nk4 + nm + occ)
O� nmk+m + occ
⇥
O
�n
k4 +m+m + occ
⇥
Example Results
Time Space
and
O(nmk + occ)
O(nmk + occ)
O(nmk)
|P | = m |Z| = n
Reference
Z =
Selecting Compression Elements
• For parameter , select a subset of the compression elements of Z such that:
• .
• From any compression element , the distance (minimum number of references) to any compression element in is at most .
� � 1 C
|C| = O(n/�)
ziC 2�
Z =
Selecting Compression Elements
• Maintain using dynamic perfect hashing while scanning from left-to-right.
• Initially, set .
• To process element follow references until we encounter :
• If the distance from to is less than we are done.
• Otherwise ( ), insert element the element at distance into .
C Z
C = {z0}
y � Czi+1
l zi+1 y 2�
l = 2� � C
• Lemma: For any parameter , is constructed in
• expected time and
• space.
Z =
Selecting Compression Elements
� � 1 C
O(n�)
O(n/�)
Q =
Computing Matches
• Strategy:
• Process from left-to-right.
• At we compute all matches ending in the substring encoded by .
Z
zi zi
phrase(zi)phrase(zi�1) . . .. . .
phrase(zi)phrase(zi�1)
rsuf(zi�1)
. . .. . .
rpre(zi)
Q =
Computing Overlapping Matches
• Let be the positions in of .
• Goal: Find all overlapping matches for , i.e., the matches starting before and ending in .
• Decompress substrings and of length around .
• Run favorite (uncompressed) approximate string matching algorithm to find matches of P in . Add offset to these to get the overlapping matches for .
[ui , ui + li � 1] Q phrase(zi)
zi[ui , ui + li � 1]
rpre(zi) rsuf(zi�1) m + k ui
rsuf(zi�1) · rpre(zi)zi
ui
m + k
zi
Computing the Relevant Prefix and Suffix
• For parameter , select a subset of the compression elements of according to Lemma 1.
• For each element in at distance more than from add “shortcut” to element at distance .
� � 1 C Z
C m + k z0m + k
Computing the Relevant Prefix
• Follow references to nearest element in .
• Follow shortcut if present.
• Compute the relevant prefix by decompressing length substring.
m + k
C
m + k
zi
Computing the Relevant Suffix
• Follow references to decompress substring of length .
• If the phrase is shorter than , recursively apply to until we have characters.
m + k
m + k
m + k zi�1 m + k
zi
Analysis
• Time = preprocess + n(find nearest element + decompress + match) =
• Space = preprocess + decompress + match =
O(n� + n(� +m + t(m, 2m + 2k, k))
O(n/� +m + s(m, 2m + 2k, k))
phrase(zi)phrase(zi�1) . . .. . .
rsuf �(zi)
Computing Internal Matches
• Goal: Find all internal matches for , i.e., all matches starting and ending within .
• Compute and store all the internal match sets indexed by compression elements using dynamic perfect hashing.
• Decompress substring of length ending at .
• Internal matches for =
[ui , ui + li � 1]zi
rsuf �(zi) min(li , m + k) ui + li � 1
zireference(zi)
�P rsuf �(zi)(internal matches for ) (matches of in )
Analysis
• Time = n (decompress + match + internal matches) =
• Space = decompress + match + total number of internal matches =
O(n(m + t(m,m + k, k)) + occ)
O(m + s(m,m + k, k) + occ)
Putting the Pieces Together
• Merging overlapping and internal matches we get all matches for ending within .
• Implies Theorem 1: For any parameter :
• expected time and
• space.
• Does not hold for ZLW compressed texts, unless space is used.
• For space the bounds hold in the worst-case and work for both ZL78 and ZLW.
[ui , ui + li � 1]zi
O(n(� +m + t(m, 2m + 2k, k)) + occ)
O(n/� +m + s(m, 2m + 2k, k)) + occ)
� � 1
�(n)
�(n)
Regular Expression Matching
• A regular expression is a generalized pattern composed from simple characters using union, concatenation, and Kleene star.
• Given a regular expression and a string the regular expression matching problem is to find all ending positions of substrings in that matches a string in the language generated by .
R Q
Q
R
Regular Expression Matching
• Let and .
• Classic solution [Thompson1968]: time and space.
• Several improvements based on the Four-Russian technique or word-level parallelism [Myers1992, NR2004, BFC2005, Bille2006].
|R| = m |Q| = u
O(um) O(m)
Compressed Regular Expression Matching
• Let and .
• Navarro [Navarro2003] simplified and without word-level parallel techniques:
• time and space.
• Our result (Theorem 2): For any parameter :
• time and
• space.
• E.g. gives time and space.
O(nm2 + occ ·m logm) O(nm2)
� � 1
O(nm(m + �) + occ ·m logm)
O(nm2/� + nm)
� = m O(nm2 + occ ·m logm) O(nm)
|R| = m |Z| = n
Remarks
• Compressed strings are large and therefore space space may not be feasible for large texts.
• Our result for compressed approximate string matching is one of the very few algorithms for compressed matching that uses space.
• More sublinear space compressed string matching algorithms are needed!
�(n)
o(n)