Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

12
I. Statement of the problem Finding all the occurrences of a substring in a string is a common operation which has a number of usages in computer science, such as: - text search within a document - search of patterns in DNA sequences - plagiarism detection - normal usage within a programming framework, including detection of regular expressions To put it formally, the string-matching problem takes as an input a text T*1…n+, array of length n, and pattern P*1…m+ of length m ≤ n. The elements of T and P are characters which belong to a finite alphabet . Thus, we say that pattern P occurs with shift s in text T if 0 s n m and T[s+1…s+m+ = P[1..m] . The string matching problem demands finding all valid shifts for which the pattern P appears in text T. II. Proposed algorithms to solve the problem Programmers are hardly required to write code for detecting substrings in a string, since this method is already built-in in many applications. However, if we take a closer look into the way it has been implemented, we would be surprised of the ingenuity of the algorithms and the relative good performance achieved by them. In this survey, I propose for analysis just a few of the most famous algorithms for finding substrings in a string, which are used today in some programming languages such as C++ and Java, or implemented in Unix utilities such as grep. As we will see later, these algorithms make use of special data structures for accomplishing the results, and we will focus specially on: - Knuth-Morris-Pratt algorithm (aka KMP), first appeared in June 1977 - Boyer-Moore algorithm, appeared in October 1977 - Bitap (aka Shift-Add) algorithm, appeared in 1992 - Robin-Karp algorithm, enhanced with Hashing a string matching tool, published in March 2014 - III. Naïve method (brute force) The naïve algorithm will act as sliding window upon the text T, at each position comparing the current substring of T with size m, against the pattern P. The pseudo-code [1] would look as below: NAÏVE-STRING-MATCHER(T, P) m length(T) n length(P) for s 0 to n-m do if P*1..m+ == T*s+1 … s+m+ then print “Pattern occurs with shift”, s

description

Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview & comparison String search, pattern match algorithms

Transcript of Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

Page 1: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

I. Statement of the problem

Finding all the occurrences of a substring in a string is a common operation which has a number of

usages in computer science, such as:

- text search within a document

- search of patterns in DNA sequences

- plagiarism detection

- normal usage within a programming framework, including detection of regular expressions

To put it formally, the string-matching problem takes as an input a text T*1…n+, array of length n, and

pattern P*1…m+ of length m ≤ n. The elements of T and P are characters which belong to a finite

alphabet ∑. Thus, we say that pattern P occurs with shift s in text T if 0 ≤ s ≤ n – m and T[s+1…s+m+ =

P[1..m] . The string matching problem demands finding all valid shifts for which the pattern P appears in

text T.

II. Proposed algorithms to solve the problem

Programmers are hardly required to write code for detecting substrings in a string, since this method is

already built-in in many applications. However, if we take a closer look into the way it has been

implemented, we would be surprised of the ingenuity of the algorithms and the relative good

performance achieved by them. In this survey, I propose for analysis just a few of the most famous

algorithms for finding substrings in a string, which are used today in some programming languages such

as C++ and Java, or implemented in Unix utilities such as grep. As we will see later, these algorithms

make use of special data structures for accomplishing the results, and we will focus specially on:

- Knuth-Morris-Pratt algorithm (aka KMP), first appeared in June 1977

- Boyer-Moore algorithm, appeared in October 1977

- Bitap (aka Shift-Add) algorithm, appeared in 1992

- Robin-Karp algorithm, enhanced with Hashing a string matching tool, published in March 2014

-

III. Naïve method (brute force)

The naïve algorithm will act as sliding window upon the text T, at each position comparing the current

substring of T with size m, against the pattern P. The pseudo-code [1] would look as below:

NAÏVE-STRING-MATCHER(T, P)

m length(T)

n length(P)

for s 0 to n-m do

if P*1..m+ == T*s+1 … s+m+ then print “Pattern occurs with shift”, s

Page 2: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

This simple algorithm has a time complexity of O((n-m+1)*m), which can be very improved, since the

naïve string-matcher has no preprocessing involved and it is believed to be inefficient because “the

information gained about the text for one value of s is entirely ignored in considering other values of s”

[1] .

IV. The Knuth-Morris-Pratt algorithm

KMP is the first algorithm to achieve linear time complexity, actually obtained by improving the previous

naïve string-matching algorithm. In contrast to that, KMP solves the inefficiency issue by keeping the

information wasted and using it to advance in the string. It also has a preprocessing part to indicate how

much of the previous failed comparison can be used as it continues.

The algorithm has two parts:

1. A preprocessing part which occurs in the initiation, whose result is building a prefix-function π . The

pattern is matched against itself (shifts of itself), in order to find the size of the largest prefix of P[1… j]

that is also a suffix of P[2..j+1]. This information helps avoiding the test of useless shifts as it previously

happened in the naïve algorithm. The pseudo-code for this part can be written as follows:

Compute-Prefix-Function ( P )

m length(P)

π[1] 0

k 0

for q 2 to m do

while k > 0 and P[k+1] ≠ P[q] do

k π[k]

if P[k+1] == P[q] then k k + 1

π[q] k

return π

To illustrate this function, I will take an example where P = “ababaacba” and try to compute the prefix

array.

P

π

a b a b a a

0

Page 3: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

Initially, m = 6 and k = 0.

For q = 2 and k = 0:

k == 0 (no while)

P[k+1] == P[q] P[1] == P[2] (false) and π[2] = 0

For q = 3 and k = 0:

k == 0 (no while)

P[1] == P[3] (true) => k = 1 and π[3] = 1

For q = 4 and k = 1:

k > 0 but P[2] ==P [4] (no while)

P[2] == P[4] (true) => k = 2 and π[4] = 2

For q = 5 and k = 2:

k > 0 but P[3] == P[5] (no while)

P[3] == P[5] => k = 3 and π[5] = 3

For q = 6 and k = 3:

k > 0 and P[4] ≠ P[6] => k = π[3] = 1

k > 0 and P[2] ≠ P[6] => k = π[1] = 0 (end while)

P[1] == P[6] => k = 1 and π[6] = 1

The new π will look like:

The meaning of π in position 4, for example, is that the size of the largest prefix of P[1..3] which is also a

suffix of P[2..4] is 2. Indeed, the largest such prefix for this case is “ab”: at position 4, the last 2

characters processed represent the largest prefix for the current sub-sequence.

The prefix function has a running time of O(m).

2. The string-match computation part, which scans the text from left to right in search for all possible

matches.

KMP-match (T, P)

n length(T)

0 0 1 2 3 1

Page 4: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

m length(P)

π = Compute-Prefix-Function(P)

q 0

for i 1 to n do // scan the text left to right

while q > 0 and P[q+1] ≠ T[i] do

q π[q] // next char does not match

if P[q+1] == T[i] then q q + 1 // next char matches

if q == m then // if all P is matched

print “Pattern occurs with shift”, i-m

q π[q] // look for the next match

To illustrate on an example, we use the same pattern “ababaa” and the text to search in T =

“abbababaaa”.

a b b a b a b a a a

n = 10, m = 6, π = [0,0,1,2,3,1], q = 0, P = “ababaa”

i = 1

P[1] == T[1] (true) q = 1

i = 2

q > 0 and P[2] ≠ T[2] (false, no while)

P[2] == T[2] q = 2 (last two characters matched)

i = 3

q > 0 and P[3] ≠ T[3] (true) q = π[2] = 0 (reset the matching)

P[1] == T[3] (false)

i = 4

P[1] == T[4] q = 1

i = 5

Page 5: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

q > 0 and P[2] ≠ T[5] (false, no while)

P[2] == T[5] q = 2

i = 6

q > 0 and P[3] ≠ T[6] (false, no while)

P[3] == T[6] q = 3 (advance)

i = 7

P[4] == T[7] q = 4

i = 8

P[5] == T[8] q = 5

i = 9

P[6] == T[9] q = 6 and q == m “Pattern occurs with shift 3” (from position 4)

Look for the next match: q π[5] = 3

i = 10

P[7] == T[10] q = 4

End.

This algorithm has a complexity O(n), therefore the total complexity for the two parts makes up to

O(m+n), which is better than the one for the naïve algorithm. The secret of KMP algorithm is that

whenever it needs to “back-up” in the pattern string, it does so by taking into account what is already

being matched from the current sub-pattern against the text.

Advantages:

Optimal running time O(n+m), which is very fast

No need to back up until the first element, in the input pattern, when a mismatch occurs

Disadvantages:

It does not run well if ∑ (size of the alphabet) increases

As for the data structures and variables it uses, the KMP algorithm only makes use of an additional array

π of size m, and two state variables q and k.

Page 6: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

V. The Boyer-Moore algorithm

“The fast string searching algorithm” published in the same year with the KMP algorithm has a different

approach by matching the text against the last characters of the pattern, all the way until the first

characters [3]. This algorithm is suitable for matching when either the alphabet is reasonably large or

the pattern is very long (as it happens in bioinformatics applications). Along with the right-to-left

approach, Boyer-Moore has two other specific rules, which can be used either alone or, for better

performance, together: the “bad character shift rule” and the “good suffix shift rule”. Usually the

running time is sub-linear, because it generally looks at fewer characters than it passes. It was proved

that the longer the pattern is, the faster the Boyer-Moore algorithm goes.

The Bad character shift rule: as we start matching at the end of pattern P and we find a

mismatch after a series of k matches in the text, we can increase the shift by k+1 without being

worried of a potential match

The Good suffix shift rule: if t is the longest suffix of P that matches T in the current position,

then P can be shifted so that the previous occurrence of t in P matches T.

1. Pseudo-code for The bad character shift rule:

function Compute-Bad-char-shift-rule ()

for k 1, length(last) do

last[k] -1

for j length(P), 1 do

if last[P[j]] < 0 then last[P[j]] = j

This function computes the last occurrence of the character P[j] in the pattern P, where the array last

has the size of the alphabet.

I will illustrate this function on our given pattern:

last will have the size of the alphabet, which is 2, ∑ = ,a, b-, and it is initialized with -1.

As we scan the pattern from backwards, we find last*‘a’+ = 6 and last*‘b’+ = 4

This function has a running time of O(m) in the worst case.

2. Pseudo-code for computing the Good suffix rule:

function Compute-Suffix ()

suffix[length(suffix)] length(suffix)

j length (suffix)

a b a b a a

6 4

Page 7: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

for i length(suffix) – 1, 1 do

while j < length(suffix) and P*j+ ≠ P[i] do

j suffix[j+1] – 1

if P[j] == P[i] then j j – 1

suffix[i] j + 1

suffix is here an auxiliary array of size m and the function to compute it is similar to the KMP steps at

failure, except that it is a backwards version. Suffix*i+ is the smallest j > I such that P*j … m-1] is a prefix

of P[i.. m-1], and if there is no such j, then suffix[i] = m. The complexity of this operation is O(m).

3. Pseudo-code for computing the matching – it stores the results into a new array match, such that the

following property holds:

Match(j) =

function Compute-Match ()

initialize match as an array with length(match) as elements

Compute-Suffix()

// try to compute match using the first criteria

for i 1, length(match) do

j suffix[i+1] – 1

if suffix[i] > j then match[j] j-i

else match[j] min (j-i+match[i] , match[j])

// compute the remaining positions in match using the second criteria

If suffix[1] < length(P) then

for j suffix[1] , 1 do

If suffix[0] < match[j] then match[j] suffix[0]

j suffix[1]

k suffix[j]

{ min , s | 0 < s ≤j and P[j-s+ ≠ P[j] and P[j-s+1..m-s-1] is suffix of P[j+1..m-1] }, if such s exists, OR min , s | j+1 ≤ s ≤ m and P[0.. m-s-1] is suffix of P[j+1.. m-1]} if such s exists, OR m, otherwise

Page 8: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

while k ≤ length(P) do

while j < k do

if match[j] > k then match[j] k

j j + 1

k suffix[k]

Finally, after having processed the match array we can start the searching main part of algorithm:

i j length(P)

while i ≤ length(T) do

if P[j] == T[i] then

if j == 1 then return i

j j – 1

i i – 1

else do

i i + length(P) – j + max (j-last[text[i]], match[j])

j length(P)

In the worst case the Boyer-Moore algorithm has a complexity of O(n+m), but only if the pattern does

not appear in the text. When the pattern occurs, the running time is O(nm) .

VI. Rabin-Karp algorithm

In 1987 M. Rabin and R. Karp came with the idea of hashing the pattern and check it against a hashed

substring of the text. We notate with ts the hashed value of the length m substring T*s+1…s+m+ and with

p the hashed value of the pattern, therefore ts = p pattern P is a substring of T from position s+1. A

popular and efficient hash function treats each substring as a number in a certain radix. For example, if

the substring is “hi” and the radix is 101, the hash value is 104*101+105 = 10609 (if we consider the

ASCII values of each letter).

In the pseudo-code below, we make the following additional notations:

- q is a large prime number, used to compute the modulus whenever the value would exceed the

allowed superior margin

Page 9: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

- d is the radix to use, typically taken as the size of the alphabet if we take it as ∑ = {0, 1, .. , d-1};

therefore dx would mean value of x in the radix d.

n length(T)

m length(P)

h dm-1 mod q

p t0 0

for i 1, m do

// preprocessing

p (dp + P[i]) mod q

t0 (dt0 + T[i]) mod q

for s 0 to n-m do

if p == ts then

if P[1..m+ == T*s+1…s+m+ then print “Pattern occurs with shift”, s

if s < n – m then ts+1 ( d(ts – T[s+1] *h) + T[s+m+1] ) mod q

To provide an example, we can take P = “cab” and T = “aabbcaba”. The radix considered is d = 26 (for the

26 letters in the alphabet) and the prime number q = 3.

m = 3, n = 8, h = 262 mod 3 = 1

Hash (P) = (3+1+2) % 3 = 0 p = 0

a a b b c a b a

Hash(“aab”) = (1+1+2) % 3 = 1 ≠ 0 so we need to shift right to the next position and calculate its hash

value

a a b b c a b a

Hash(“abb”) = (Hash(“aab”) – Hash(“a”) *h) * d + Hash(“b”) mod 3

= (1 – 1* 1) * 26 + 2 mod 3

= 2 ≠ 0, so we shift right to the next position

Page 10: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

a a b b c a b a

Hash (“bbc”) = (Hash(“abb”) – Hash(“a”) * h) * d + Hash(“c”) mod 3

= (2 – 1 * 1) * 26 + 3 mod 3 2 ≠ 0, so we shift right again

a a b b c a b a

Hash (“bca”) = (2 – 2 * 1) * 26 + 1 mod 3 1 mod 3 = 1 ≠ 0

a a b b c a b a

Hash(“cab”) = (1 – 2 * 1) * 26 + 2 mod 3 -24 % 3 = 0 , so we found a potential string match: we

verify the match by linear comparison and see it is true. It could also happen to obtain a spurious hit, in

which case the hash value coincides but the substring is different than the pattern.

The Rabin-Karp algorithm has a running time of O((n-m+1)m) in the worst case if it obtains many valid

shifts which need to verified. However, usually it will not perform as many character matches as the

naïve algorithm would do. In practice, the prime number q is taken as large enough (q ≥ m) and the

expected matching time is only O(n+m). Since m ≤ n, we can expect a O(n) in one of the best cases.

As for the data structures employed, the algorithm is not very demanding, as it is enough to have a new

array ts and a couple of other variables whose meaning I have denoted above.

VII. BITAP

In the “A new approach to text searching” *4+, appeared in 1992, it is described an approximate string

matching algorithm which is comparable with the KMP applied for any pattern length and with the

Boyer-Moore when it is applied to short patterns. However, in terms of patterns with don’t care

symbols, this is the first suitable algorithm which can also be applied at the hardware level.

The algorithm tells whether a given text contains a substring “approximately equal” to a given pattern,

where the approximation is calculated using the Levenstein distance. All the computation is done in

terms of bitmasks and bitwise operations which give the algorithm an increased speed. Perhaps the

most famous application of BITAP is found in the agrep (approximate grep) Unix utility, licensed by the

University of Arizona.

The C short implementation given in the publication uses register unsigned integer variables and char

arrays for defining the pattern. The bitwise operations are represented by complements, left or right

shifts as well as AND and OR operations. In the following lines I will examine the exact matching.

Supposing we have a pattern and a text to search in, we can determine the size S of the alphabet and

create a matrix of S rows and 31 columns, where the 31 positions are in the form of bit values {0, 1} to

form an integer, and each row is dedicated to a symbol from the alphabet. We can represent

Page 11: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

occurrences of each symbol into our pattern by assigning M[s][pos] = 0, where s is the symbol and pos is

the position in the pattern where s occurs. For the rest of the matrix values, they are assigned to 1.

We use a new variable named state of size 31 initialized with ~1 = 111…..10. The process will run by

iterating through the characters of the text and performing the two operations below:

1) select the current character of the text and bitwise-OR the corresponding pattern (from the matrix)

with the state

2) left-shift the state

For example, taking the pattern string “aba”, the matrix will look like:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 | Pattern(a)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 | Pattern(b)

The initial state will be: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

The text we want to search in is “baba”

Iterating through the first character “b”, we perform Pattern(b) | State, then State << 1, with updating

the State after each operation, to result the following new state:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

Repeating the process over and over and going through different states, we can check if the pattern was

found by testing the lowest m+1th bit of the state against 0.

This algorithm has a runtime of O( |mb/w| (n + S)) , for the preprocessing time, where | mb/w | should

be considered as the upper bond of mb/w, representing the time to compute a constant number of

operations on integers of mb bits using a word of size w. As for the search time, the complexity in the

worst and average case is O( |mb/w| n).

VIII. Comparison and contrast; advantages and disadvantages

To summarize and quickly compare the approaches offered by the five algorithms discussed, I created

the following table:

Complexity

Additional data

structures/ techniques

Most suitable for

Disadvantages Applications

Naïve O((n-m+1)*m) - Very short sized text & pattern

Slow and inefficient for large strings

For educational purposes

Page 12: Knuth-Morris-Pratt | Boyer-Moore | Rabin-Karp | BITAP | overview

KMP O(n+m) Array π, state variables k and q

Binary strings Does not run fast if ∑ increases

-

BM O(n+m) but in practice sub-linear performance

3 new arrays of size m

∑ moderately sized and the pattern is relatively long

Does not work well with binary strings or very short patterns

In C++, Boost library; Text editors, commands

Rabin-Karp Worst case: O((n-m+1)m)

but O(n) in

the best case

Hashing with the use of radix d and modulo q

Finding multiple pattern matches

Is as slow as the naïve algorithm but requires more space

Text processing, bioinformatics, compression, detection of plagiarism

BITAP O( |mb/w| n), usually sub-linear

Matrix for occurrences of the alphabet in the pattern; bitwise operations

Long patterns (it speeds up)

Does not perform well on large size of the alphabet

ugrep

IX. REFERENCES

1. Introduction to Algorithms, Thomas Cormen, 2nd edition, chapter 32 - “String Matching”

2. “Fast pattern matching in strings”, D. Knuth, J. Morris, V. Pratt, appeared in June 1977 in SICOMP

3. “A fast string searching algorithm”, R. Boyer, J. Moore, ACM 1977

4. “Efficient randomized pattern-matching algorithms”, Karp & Robin, 1987

5. “A new approach to text searching”, R. Baeza-Yates, G. Gonnet, 1992