Improved Approximate String Matching and Regular ...

24
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz Technical University of Denmark

Transcript of Improved Approximate String Matching and Regular ...

Page 1: Improved Approximate String Matching and Regular ...

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Philip BilleIT University of Copenhagen

Rolf FagerbergUniversity of Southern Denmark

Inge Li GørtzTechnical University of Denmark

Page 2: Improved Approximate String Matching and Regular ...

Approximate String Matching

• The edit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. E.g., edit-distance(“cocoa”, “cola”) = 2.

• Let and be strings and let (integer ) be an error threshold.

• The approximate string matching problem is to find all ending positions of substrings in whose edit distance to is at most .

P Q k

P kQ

> 0

Page 3: Improved Approximate String Matching and Regular ...

Results

[Sellers1980]

[LV1989]

[CH2002]|Q| = u

Time Space

and

O(um)

O(uk)

O(m)

O(m)

O

�uk4

m+ u

⇥O(m)

|P | = m

Reference

Page 4: Improved Approximate String Matching and Regular ...

Q = ananas

Z = (0,a)(0,n)(1,n)(1,s)

Z =

a

s n

n0

1 2

34z0 z1 z2 z3 z4

Ziv-Lempel 1978 compression

Page 5: Improved Approximate String Matching and Regular ...

Approximate String Matching on ZL78 compressed texts

• Let be a string and be a ZL78 compressed representation of a string .

• Given and , the compressed approximate string matching problem is to solve the approximate string matching for and without decompressing .

• Goal: Do it more efficiently than decompressing and using the best (uncompressed) approximate string matching algorithm.

QP Z

P ZP Q Z

Z

Page 6: Improved Approximate String Matching and Regular ...

Applications

• Textual data bases (e.g. DNA sequence collections) issues:

• Save space = keep data in compressed form.

• Search efficiently.

• Solution: Compressed string matching algorithms.

Page 7: Improved Approximate String Matching and Regular ...

|Z| = n|P | = m

Results

• Let and .

• Kärkkäinen, Navarro, and Ukkonen [KNU2003]:

• time and space.

• Our result (Theorem 1): For any parameter :

• expected time and

• space.

O(n(� +m + t(m, 2m + 2k, k)) + occ)

O(n/� +m + s(m, 2m + 2k, k)) + occ)

O(nmk + occ) O(nmk)

� � 1

Page 8: Improved Approximate String Matching and Regular ...

[KNU2003][KNU2003]

LV +

This paperCH +

This paper

� = mk

� = k4 +mO(nk4 + nm + occ)

O� nmk+m + occ

O

�n

k4 +m+m + occ

Example Results

Time Space

and

O(nmk + occ)

O(nmk + occ)

O(nmk)

|P | = m |Z| = n

Reference

Page 9: Improved Approximate String Matching and Regular ...

Z =

Selecting Compression Elements

• For parameter , select a subset of the compression elements of Z such that:

• .

• From any compression element , the distance (minimum number of references) to any compression element in is at most .

� � 1 C

|C| = O(n/�)

ziC 2�

Page 10: Improved Approximate String Matching and Regular ...

Z =

Selecting Compression Elements

• Maintain using dynamic perfect hashing while scanning from left-to-right.

• Initially, set .

• To process element follow references until we encounter :

• If the distance from to is less than we are done.

• Otherwise ( ), insert element the element at distance into .

C Z

C = {z0}

y � Czi+1

l zi+1 y 2�

l = 2� � C

Page 11: Improved Approximate String Matching and Regular ...

• Lemma: For any parameter , is constructed in

• expected time and

• space.

Z =

Selecting Compression Elements

� � 1 C

O(n�)

O(n/�)

Page 12: Improved Approximate String Matching and Regular ...

Q =

Computing Matches

• Strategy:

• Process from left-to-right.

• At we compute all matches ending in the substring encoded by .

Z

zi zi

phrase(zi)phrase(zi�1) . . .. . .

Page 13: Improved Approximate String Matching and Regular ...

phrase(zi)phrase(zi�1)

rsuf(zi�1)

. . .. . .

rpre(zi)

Q =

Computing Overlapping Matches

• Let be the positions in of .

• Goal: Find all overlapping matches for , i.e., the matches starting before and ending in .

• Decompress substrings and of length around .

• Run favorite (uncompressed) approximate string matching algorithm to find matches of P in . Add offset to these to get the overlapping matches for .

[ui , ui + li � 1] Q phrase(zi)

zi[ui , ui + li � 1]

rpre(zi) rsuf(zi�1) m + k ui

rsuf(zi�1) · rpre(zi)zi

ui

Page 14: Improved Approximate String Matching and Regular ...

m + k

zi

Computing the Relevant Prefix and Suffix

• For parameter , select a subset of the compression elements of according to Lemma 1.

• For each element in at distance more than from add “shortcut” to element at distance .

� � 1 C Z

C m + k z0m + k

Page 15: Improved Approximate String Matching and Regular ...

Computing the Relevant Prefix

• Follow references to nearest element in .

• Follow shortcut if present.

• Compute the relevant prefix by decompressing length substring.

m + k

C

m + k

zi

Page 16: Improved Approximate String Matching and Regular ...

Computing the Relevant Suffix

• Follow references to decompress substring of length .

• If the phrase is shorter than , recursively apply to until we have characters.

m + k

m + k

m + k zi�1 m + k

zi

Page 17: Improved Approximate String Matching and Regular ...

Analysis

• Time = preprocess + n(find nearest element + decompress + match) =

• Space = preprocess + decompress + match =

O(n� + n(� +m + t(m, 2m + 2k, k))

O(n/� +m + s(m, 2m + 2k, k))

Page 18: Improved Approximate String Matching and Regular ...

phrase(zi)phrase(zi�1) . . .. . .

rsuf �(zi)

Computing Internal Matches

• Goal: Find all internal matches for , i.e., all matches starting and ending within .

• Compute and store all the internal match sets indexed by compression elements using dynamic perfect hashing.

• Decompress substring of length ending at .

• Internal matches for =

[ui , ui + li � 1]zi

rsuf �(zi) min(li , m + k) ui + li � 1

zireference(zi)

�P rsuf �(zi)(internal matches for ) (matches of in )

Page 19: Improved Approximate String Matching and Regular ...

Analysis

• Time = n (decompress + match + internal matches) =

• Space = decompress + match + total number of internal matches =

O(n(m + t(m,m + k, k)) + occ)

O(m + s(m,m + k, k) + occ)

Page 20: Improved Approximate String Matching and Regular ...

Putting the Pieces Together

• Merging overlapping and internal matches we get all matches for ending within .

• Implies Theorem 1: For any parameter :

• expected time and

• space.

• Does not hold for ZLW compressed texts, unless space is used.

• For space the bounds hold in the worst-case and work for both ZL78 and ZLW.

[ui , ui + li � 1]zi

O(n(� +m + t(m, 2m + 2k, k)) + occ)

O(n/� +m + s(m, 2m + 2k, k)) + occ)

� � 1

�(n)

�(n)

Page 21: Improved Approximate String Matching and Regular ...

Regular Expression Matching

• A regular expression is a generalized pattern composed from simple characters using union, concatenation, and Kleene star.

• Given a regular expression and a string the regular expression matching problem is to find all ending positions of substrings in that matches a string in the language generated by .

R Q

Q

R

Page 22: Improved Approximate String Matching and Regular ...

Regular Expression Matching

• Let and .

• Classic solution [Thompson1968]: time and space.

• Several improvements based on the Four-Russian technique or word-level parallelism [Myers1992, NR2004, BFC2005, Bille2006].

|R| = m |Q| = u

O(um) O(m)

Page 23: Improved Approximate String Matching and Regular ...

Compressed Regular Expression Matching

• Let and .

• Navarro [Navarro2003] simplified and without word-level parallel techniques:

• time and space.

• Our result (Theorem 2): For any parameter :

• time and

• space.

• E.g. gives time and space.

O(nm2 + occ ·m logm) O(nm2)

� � 1

O(nm(m + �) + occ ·m logm)

O(nm2/� + nm)

� = m O(nm2 + occ ·m logm) O(nm)

|R| = m |Z| = n

Page 24: Improved Approximate String Matching and Regular ...

Remarks

• Compressed strings are large and therefore space space may not be feasible for large texts.

• Our result for compressed approximate string matching is one of the very few algorithms for compressed matching that uses space.

• More sublinear space compressed string matching algorithms are needed!

�(n)

o(n)