Improved Approximate String Matching and Regular...
Transcript of Improved Approximate String Matching and Regular...
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Philip BilleIT University of Copenhagen
Rolf FagerbergUniversity of Southern Denmark
Inge Li GørtzTechnical University of Denmark
Approximate String Matching
• The edit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. E.g., edit-distance(“cocoa”, “cola”) = 2.
• Let and be strings and let (integer ) be an error threshold.
• The approximate string matching problem is to find all ending positions of substrings in whose edit distance to is at most .
P Q k
P kQ
> 0
Results
[Sellers1980]
[LV1989]
[CH2002]
|Q| = u
Time Space
and
O(um)
O(uk)
O(m)
O(m)
O
!uk4
m+ u
"O(m)
|P | = m
Reference
Q = ananas
Z = (0,a)(0,n)(1,n)(1,s)
Z =
a
s n
n0
1 2
34z0 z1 z2 z3 z4
Ziv-Lempel 1978 compression
Approximate String Matching on ZL78 compressed texts
• Let be a string and be a ZL78 compressed representation of a string .
• Given and , the compressed approximate string matching problem is to solve the approximate string matching for and without decompressing .
• Goal: Do it more efficiently than decompressing and using the best (uncompressed) approximate string matching algorithm.
QP Z
P ZP Q Z
Z
Applications
• Textual data bases (e.g. DNA sequence collections) issues:
• Save space = keep data in compressed form.
• Search efficiently.
• Solution: Compressed string matching algorithms.
|Z| = n|P | = m
Results
• Let and .
• Kärkkäinen, Navarro, and Ukkonen [KNU2003]:
• time and space.
• Our result (Theorem 1): For any parameter :
• expected time and
• space.
O(n(! +m + t(m, 2m + 2k, k)) + occ)
O(n/! +m + s(m, 2m + 2k, k)) + occ)
O(nmk + occ) O(nmk)
! ! 1
[KNU2003]
LV +
This paperCH +
τ = mk
! = k4 +mO(nk4 + nm + occ)
O! nmk+m + occ
"
O
(n
k4 +m+m + occ
)
Example Results
Time Space
and
O(nmk + occ)
O(nmk + occ)
O(nmk)
|P | = m |Z| = n
Reference
Z =
Selecting Compression Elements
• For parameter , select a subset of the compression elements of Z such that:
• .
• From any compression element , the distance (minimum number of references) to any compression element in is at most .
! ! 1 C
|C| = O(n/!)
ziC 2!
Z =
Selecting Compression Elements
• Maintain using dynamic perfect hashing while scanning from left-to-right.
• Initially, set .
• To process element follow references until we encounter :
• If the distance from to is less than we are done.
• Otherwise ( ), insert element the element at distance into .
C Z
C = {z0}
y ! Czi+1
l zi+1 y 2τ
l = 2! ! C
• Lemma: For any parameter , is constructed in
• expected time and
• space.
Z =
Selecting Compression Elements
! ! 1 C
O(n!)
O(n/τ)
Q =
Computing Matches
• Strategy:
• Process from left-to-right.
• At we compute all matches ending in the substring encoded by .
Z
zi zi
phrase(zi)phrase(zi!1) . . .. . .
phrase(zi)phrase(zi!1)
rsuf(zi−1)
. . .. . .
rpre(zi)
Q =
Computing Overlapping Matches
• Let be the positions in of .
• Goal: Find all overlapping matches for , i.e., the matches starting before and ending in .
• Decompress substrings and of length around .
• Run favorite (uncompressed) approximate string matching algorithm to find matches of P in . Add offset to these to get the overlapping matches for .
[ui , ui + li − 1] Q phrase(zi)
zi[ui , ui + li − 1]
rpre(zi) rsuf(zi!1) m + k ui
rsuf(zi!1) · rpre(zi)zi
ui
m + k
zi
Computing the Relevant Prefix and Suffix
• For parameter , select a subset of the compression elements of according to Lemma 1.
• For each element in at distance more than from add “shortcut” to element at distance .
! ! 1 C Z
C m + k z0m + k
Computing the Relevant Prefix
• Follow references to nearest element in .
• Follow shortcut if present.
• Compute the relevant prefix by decompressing length substring.
m + k
C
m + k
zi
Computing the Relevant Suffix
• Follow references to decompress substring of length .
• If the phrase is shorter than , recursively apply to until we have characters.
m + k
m + k
m + k zi−1 m + k
zi
Analysis
• Time = preprocess + n(find nearest element + decompress + match) =
• Space = preprocess + decompress + match =
O(n! + n(! +m + t(m, 2m + 2k, k))
O(n/! +m + s(m, 2m + 2k, k))
phrase(zi)phrase(zi!1) . . .. . .
rsuf !(zi)
Computing Internal Matches
• Goal: Find all internal matches for , i.e., all matches starting and ending within .
• Compute and store all the internal match sets indexed by compression elements using dynamic perfect hashing.
• Decompress substring of length ending at .
• Internal matches for =
[ui , ui + li − 1]zi
rsuf !(zi) min(li , m + k) ui + li ! 1
zireference(zi)
!P rsuf !(zi)(internal matches for ) (matches of in )
Analysis
• Time = n (decompress + match + internal matches) =
• Space = decompress + match + total number of internal matches =
O(n(m + t(m,m + k, k)) + occ)
O(m + s(m,m + k, k) + occ)
Putting the Pieces Together
• Merging overlapping and internal matches we get all matches for ending within .
• Implies Theorem 1: For any parameter :
• expected time and
• space.
• Does not hold for ZLW compressed texts, unless space is used.
• For space the bounds hold in the worst-case and work for both ZL78 and ZLW.
[ui , ui + li − 1]zi
O(n(! +m + t(m, 2m + 2k, k)) + occ)
O(n/! +m + s(m, 2m + 2k, k)) + occ)
! ! 1
!(n)
!(n)
Regular Expression Matching
• A regular expression is a generalized pattern composed from simple characters using union, concatenation, and Kleene star.
• Given a regular expression and a string the regular expression matching problem is to find all ending positions of substrings in that matches a string in the language generated by .
R Q
Q
R
Regular Expression Matching
• Let and .
• Classic solution [Thompson1968]: time and space.
• Several improvements based on the Four-Russian technique or word-level parallelism [Myers1992, NR2004, BFC2005, Bille2006].
|R| = m |Q| = u
O(um) O(m)
Compressed Regular Expression Matching
• Let and .
• Navarro [Navarro2003] simplified and without word-level parallel techniques:
• time and space.
• Our result (Theorem 2): For any parameter :
• time and
• space.
• E.g. gives time and space.
O(nm2 + occ ·m logm) O(nm2)
! ! 1
O(nm(m + !) + occ ·m logm)
O(nm2/! + nm)
! = m O(nm2 + occ ·m logm) O(nm)
|R| = m |Z| = n
Remarks
• Compressed strings are large and therefore space space may not be feasible for large texts.
• Our result for compressed approximate string matching is one of the very few algorithms for compressed matching that uses space.
• More sublinear space compressed string matching algorithms are needed!
!(n)
o(n)