Regular expression matching with multi-strings and...

Regular Expression Matching with Multi-Strings and Intervals

Philip Bille∗

[email protected]

Mikkel [email protected]

Abstract

Regular expression matching is a key task (and of-ten computational bottleneck) in a variety of softwaretools and applications. For instance, the standardgrep and sed utilities, scripting languages such asperl, internet traffic analysis, XML querying, andprotein searching. The basic definition of a regu-lar expression is that we combine characters withunion, concatenation, and kleene star operators. Thelength m is proportional to the number of charac-ters. However, often the initial operation is to con-catenate characters in fairly long strings, e.g., if wesearch for certain combinations of words in a firewall.As a result, the number k of strings in the regularexpression is significantly smaller than m. Our mainresult is a new algorithm that essentially replaces mwith k in the complexity bounds for regular expres-sion matching. More precisely, after an O(m log k)time and O(m) space preprocessing of the expression,we can match it in a string presented as a streamof characters in O(k log w

w + log k) time per character,where w is the number of bits in a memory word. Forlarge w, this corresponds to the previous best boundof O(m log w

w + logm). Prior to this work no O(k)bound per character was known. We further extendour solution to efficiently handle character class inter-val operators Cx, y. Here, C is a set of charactersand Cx, y, where x and y are integers such that0 ≤ x ≤ y, represents a string of length between xand y from C. These character class intervals gener-alize variable length gaps which are frequently usedfor pattern matching in computational biology appli-cations.

1 Introduction

1.1 The Problem A regular expression specifiesa set of strings formed by characters combined withconcatenation, union (|), and kleene star (*) opera-tors, e.g., (a|(ba))* is a string of as and bs whereevery b is followed by an a. Given a regular expres-

∗Supported by the Danish Agency for Science, Technology,and Innovation.

sion R and a string Q the regular expression matchingproblem is to decide if Q matches any of the stringsspecified by R. This problem is a key primitive in awide variety of software tools and applications. Stan-dard tools such as grep and sed provide direct sup-port for regular expression matching in files. Thescripting language perl [30] is a full programminglanguage that is designed to easily support regularexpression matching. In large scale data processingapplications, such as internet traffic analysis [14, 31],XML querying [18, 21], and protein searching [25],regular expression matching is often the main com-putational bottleneck.

Historically, regular expression matching goesback to Kleene in the 1950s [16]. It became popu-lar for practical use in text editors in the 1960s [28].Compilers use regular expression matching for sepa-rating tokens in the lexical analysis phase [2].

1.2 Previous Work Let R be a regular expressionof length m and let Q be a string of length n. Typi-cally, we have that m n. The alphabet is denotedΣ. All of the bounds mentioned below and presentedin this paper hold for a standard unit-cost RAM withw-bit words with standard word operations includ-ing arithmetic and logical operations. This meansthat the algorithms can be implemented directly instandard imperative programming languages such asC [15] or C++ [27]. We note that most open sourceimplementations of algorithms for regular expression,e.g., grep, sed and perl, are written in C. An indexinto Q can be stored in a single word and thereforew ≥ log n. The space complexity is the number ofwords used by the algorithm, not counting the inputwhich is assumed to be read-only.

The classical textbook solution to the problemby Thompson [28] from 1968 takes O(nm) time.It uses a standard state-set simulation of a non-deterministic finite automaton (NFA) with O(m)states and transitions produced from R. Usingthe NFA, we scan Q. Each of the n charactersis processed in O(m) time by a standard state-setsimulation of the NFA.

1297 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

In 1985 Galil [10] asked if a faster algorithmcould be derived. Myers [23] met this challenge in1992 with an O( nm

log n + (n + m) log n) time solution,thus improving the time complexity of Thompsonalgorithm by a log n factor for most values of n andm. Bille and Farach-Colton [5] improved the spaceusage of Myers’ algorithm. Recently, the authorsobtained an algorithm using O(nm log log n

log1,5 n+ n + m)

time [6], further improving Thompson’s algorithm bya log1,5 n

log log n factor. The space complexity is O(nε +m).All of these improvements decompose the NFA fromThompson’s algorithm into micro NFAs and tabulateinformation for these to speedup the simulation ofthe NFA on Q. In [6] the tabulation is 2-dimensional,covering micro NFAs together with string segments.All these tabulation-based approaches leads to atime-space tradeoff depending on the amount of spacewe can use for tables.

An incomparable approach of Bille [4] is to simu-late the micro NFAs using standard word operationsto work on several states in parallel. This is ideal fora small space streaming algorithm since it only usesO(m) space. After an O(m logm) time and O(m)space preprocessing, he can process each character ofthe input string in O(m log w

w +logm) time. Note thatthis is better than the bound from [6] if w ≥ log1.5 n.Bille also has a constant bound per character for verysmall patterns with m ≤ √w.

1.3 Our Results We identify and address twobasic questions for regular expression matching.

1.3.1 Multi-Strings In the above applications,we often see that regular expression are based on acomparatively small set of strings, directly concate-nated from characters. As an example, consider thefollowing regular expression used to detect a Gnutelladata download signature in a stream [26]:

(Server:|User-Agent:)( |\t)*(LimeWire|BearShare|Gnucleus|Morpheus|XoloX|gtk-gnutella|Mutella|MyNapster|Qtella|AquaLime|NapShare|Comback|PHEX|SwapNut|FreeWire|Openext|Toadnode)

Here, the total size of the regular expression is m =174 (the \t is a single tab character) but the numberof strings is only k = 21.

Our first question is if we can design a moreefficient algorithm that exploits k m. No previouspaper has addressed this issue, so it was not evenknown if we can improve Thompson [28] classicO(nm) bound. We show the following result.

Theorem 1.1. Let R be a regular expression oflength m containing k strings and let Q be a string oflength n. On a unit-cost RAM we can solve regularexpression matching in time

O

(n

(k

logww

+ log k)

+m log k)

= O(nk+m log k) .

This works as a streaming algorithm that after anO(m log k) time and O(m) space preprocessing canprocess each input character in O(k log w

w +log k) time.

Compared to the algorithm by Bille [4] we replace mwith k.

1.3.2 Character Class Intervals We consider anextension of the standard set of operators in regularexpressions (concatenation, union, and kleene star).Given a regular expression R and integers x and y,0 ≤ x ≤ y, the interval operator Rx, y is shorthandfor the expression

x︷︸︸︷R · · ·R

y−x︷︸︸︷(R|ε) · · · (R|ε) ,

that is, at least x and at most y concatenated copiesof R.

We are interested in the important special caseof character class intervals defined as follows. Givena set of characters α1, . . . , αc the character classC = [α1, . . . , αc] is shorthand for the expression(α1|...|αc). Thus, the character class interval Cx, yis shorthand for

Cx, y =

x︷︸︸︷(α1| . . . |αc) · · · (α1| . . . |αc) ·

y−x︷︸︸︷(α1| . . . |αc|ε) · · · (α1| . . . |αc|ε) .

Hence, the length of the short representation ofCx, y is O(|C|) whereas the length of Cx, ytranslated to standard operators is Ω(|C|y).

Our second question is if we can design an algo-rithm for regular expression matching with characterclass intervals that is more efficient than simply usingthe above translation with an algorithm for the stan-dard regular expression matching problem. Again,no previous paper has addressed this issue. We showthe following generalization of Theorem 1.1.

Theorem 1.2. Let R be a regular expression oflength m containing k strings and character class in-tervals and let Q be a string of length n. On a unit-cost RAM we can solve regular expression matching


in time

O

(n

(k

logww

+ log k)

+X +m logm).

Here, X is the sum of the lower bounds on the lengthsof the character class intervals. This works as astreaming algorithm that after an O(X + m logm)time and O(X + m) space preprocessing can processeach input character in O(k log w

w + log k) time.

Even without multi-strings this is the first algorithmto efficiently support character class intervals in reg-ular expression matching.

The special case Σx, y (we identify characterclasses by their corresponding sets) is called a vari-able length gap since it specifies an arbitrary string oflength at least x and at most y. Variable length gapsare frequently used in computational biology applica-tions [8,9,22,24,25]. For instance, the PROSITE database [7, 13] supports searching for proteins specifiedby patterns formed by concatenation of characters,character classes, and variable length gaps.

In general the variable length gaps may be longcompared to the length of the expression. Forinstance, the pattern

MT·Σ115, 136·MTNTAYGG·Σ121, 151·GTNGAYGAY

appears in Morgante et al. [20].For various restricted cases of regular expressions

some results for variable length gaps are known.For patterns composed using only concatenation,character classes, and variable length gaps Navarroand Raffinot [25] gave an algorithm using O(m+Y

w +1)time per character, where Y is the sum of the upperbounds on the length of the variable length gaps.This algorithm encodes each variable length gapusing a number of bits proportional to the upperbound on the length of the gap. Fredriksson andGrabowski [8, 9] improved this for the case when allvariable length gaps have lower bound 0 and identicalupper bound y. They showed how to encode eachsuch gaps using O(log y) bits [8] and subsequentlyO(log log y) bits [9], leading to an algorithm usingO(m log log y

w +1) time per character. The latter resultsare based on maintaining multiple counters efficientlyin parallel. We introduce an improved algorithm formaintaining counters in parallel as a key componentin Theorem 1.2. Plugging in these counters in thealgorithms of Fredriksson and Grabowski [8, 9] weobtain an algorithm using O(m

y + 1) per characterfor this problem. With our counters we can evenremove the restriction that each variable length gap

should have the same upper bound. However, whenk m the bound of Theorem 1.2 for the more generalproblem is better.

1.4 Technical Outline We will first show how toget an O(nk + m log k) bound for regular expres-sion matching using the classic multi-string match-ing algorithm of Aho and Corasick [1] inside Thomp-son’s [28] classic regular expression matching algo-rithm. In order to achieve the speed up we showhow to efficiently handle multiple bit queues of non-uniform length and combine these with the algorithmof Bille [4]. For the character class intervals we fur-ther extend our algorithm to efficiently handle multi-ple non-uniform counters. We believe that our tech-niques for non-uniform bit queues and counters are ofindependent interest. We do not see a way of benefit-ting from a small k m in tabulation based methodslike that in [6].

2 Basic Concepts

We briefly review the classical concepts used in thepaper.

2.1 Regular Expression and Finite AutomataThe set of regular expressions over Σ are definedrecursively as follows: A character α ∈ Σ is a regularexpression, and if S and T are regular expressionsthen so is the concatenation, (S) · (T ), the union,(S)|(T ), and the star, (S)∗. The language L(R)generated by R is defined as follows: L(α) = α,L(S · T ) = L(S) · L(T ), that is, any string formedby the concatenation of a string in L(S) with astring in L(T ), L(S)|L(T ) = L(S) ∪ L(T ), andL(S∗) =

⋃i≥0 L(S)i, where L(S)0 = ε and L(S)i =

L(S)i−1 · L(S), for i > 0. Here ε denotes the emptystring. The parse tree T (R) for R is the unique rootedbinary tree representing the hierarchical structure ofR. The leaves of T (R) are labeled by a characterfrom Σ and internal nodes are label by either ·, |, or∗.

A finite automaton is a tuple A = (V,E,Σ, θ, φ),where V is a set of nodes called states, E is a setof directed edges between states called transitionseach labeled by a character from Σ ∪ ε, θ ∈ Vis a start state, and φ ∈ V is an accepting state1.In short, A is an edge-labeled directed graph with aspecial start and accepting node. A is a deterministicfinite automaton (DFA) if A does not contain any

1Sometimes NFAs are allowed a set of accepting states, butthis is not necessary for our purposes.


ε-transitions, and all outgoing transitions of anystate have different labels. Otherwise, A is a non-deterministic automaton (NFA). When we deal withmultiple automatons, we use a subscript A to indicateinformation associated with automaton A, e.g., θA

denotes the start state of automaton A.Given a string Q and a path p in A we say that p

and Q match if the concatenation of the labels on thetransitions in p is Q. We say that A accepts a stringQ if there is a path in A from θ to φ that matches Q.Otherwise A rejects Q. Let S be a state-set in A andlet α be a character from Σ, and define the followingoperations.

Move(S, α): Return the set of states reachable fromS through a single transition labeled α.

Close(S): Return the set of states reachable fromS through a path of 0 or more ε-transitions.

One may use a sequence of Move and Close operationsto test if A accepts a string Q of length n as follows.First, set S0 := Close(θ). For i = 1, . . . , n computeSi := Close(Move(Si−1, Q[i])). It follows inductivelythat Si is the set states in A reachable by a path fromθ matching the ith prefix of Q. Hence, Q is acceptedby A if and only if φ ∈ Sn.

Given a regular expression R, an NFA A accept-ing precisely the strings in L(R) can be obtainedby several classic methods [11, 19, 28]. In particular,Thompson [28] gave the simple well-known construc-tion in Figure 1. We will call an automaton con-structed with these rules a Thompson NFA (TNFA).Fig. 2(a) shows the TNFA for the regular expressionR = (aba|a)∗ba.

A TNFA N(R) for R has at most 2m states, atmost 4m transitions, and can be computed in O(m)time. We can implement the Move operation onN(R)in O(m) time by inspecting all transitions. With abreadth-first search of N(R) we can also implementthe Close operation in O(m) time. Hence, we can testacceptance of a string Q of length n in O(nm) time.This is Thompson’s algorithm [28].

2.2 Multi-String Matching Given a set of pat-tern strings P = P1, . . . , Pk of total length m and atext Q of length n the multi-string matching problemis to report all occurrences of each pattern string inQ.Aho and Corasick [1] generalized the classical Knuth-Morris-Pratt algorithm [17] for single string match-ing to multiple strings. The Aho-Corasick automa-ton (AC-automaton) for P , denoted AC(P ), consistsof the trie of the patterns in P . Hence, any path

from the root of the trie to a state s corresponds toa prefix of a pattern in P . We denote this prefix bypath(s). For each state s there is also a special failuretransition pointing to the unique state s′ such thatpath(s′) is the longest prefix of a pattern in P match-ing a proper suffix of path(s). Note that the depthof s′ in the trie is always strictly smaller for non-rootstates than the depth of s.

Finally, for each state s we store the subsetocc(s) ⊆ P of patterns that match a suffix of path(s).Since the patterns in occ(s) share suffixes we canrepresent occ(s) compactly by storing for s the indexof the longest string in occ(s) and a pointer to thestate s′ such that path(s′) is the second longest stringif any. In this way we can report occ(s) in O(|occ(s)|)time.

The maximum outdegree of any state is boundedby the number of leaves in the trie which is at most k.Hence, using a standard comparison-based balancedsearch tree to index the trie transitions out of eachstate we can construct AC(P ) in O(m log k) time andO(m) space. Fig. 2(c) shows the AC-automaton forthe set of patterns aba, a, ba. Here, the failuretransitions are shown by dashed arrows and the non-empty sets of occurrences are listed by indices to thethree strings.

For a state s and character α define the state-transition ∆(s, α) to be the unique state s′ such thatpath(s′) is the longest suffix that matches path(s) ·α.To compute a state-set transition ∆(s, α) first test ifα matches the label of a trie transition t from s. Ifso we return the child endpoint of t. Otherwise, werecursively follow failure transitions from s until wefind a state s′ with a trie transition t′ labeled α andreturn the child endpoint of t′. If no such state existswe return the root the trie. Each lookup among thetrie transitions uses O(log k) time. One may use asequence of state transitions to find all occurrencesof P in Q as follows. First, set s0 to be the root ofAC(L). For i = 1, . . . , n compute si := ∆(si−1, Q[i])and report the set occ(si). Inductively, it holdsthat path(si) is the longest prefix of a pattern inP matching a suffix of the ith prefix of Q. Foreach failure transition traversed in the algorithmwe must traverse at least as many trie transitions.Therefore, the total time to traverse AC(P ) andreport occurrences is O(n log k + occ), where occ isthe total number of occurrences.

Hence, the Aho-Corasick algorithm solves multi-string matching in O((n + m) log k + occ) time andO(m) space.


(a) (b)

(c)(d)

!

N(S)

N(T )

!

N(T )

N(S)

N(S)!

!

!

!

!

! !

!

!!

!!

!!

!

Figure 1: Thompson’s recursive NFA construction. The regular expression for a character α ∈ Σ correspondsto NFA (a). If S and T are regular expressions then N(ST ), N(S|T ), and N(S∗) correspond to NFAs (b),(c), and (d), respectively. In each of these figures, the leftmost node θ and rightmost node φ are the start andthe accept nodes, respectively. For the top recursive calls, these are the start and accept nodes of the overallautomaton. In the recursions indicated, e.g., for N(ST ) in (b), we take the start node of the subautomatonN(S) and identify with the state immediately to the left of N(S) in (b). Similarly the accept node of N(S)is identified with the state immediately to the right of N(S) in (b).

3 A Simple Algorithm

In this section we present a simple comparison-based algorithm for regular expression matching us-ing O(nk+m log k) time. In the following section wegive a faster implementation of the algorithm leadingto Theorem 1.1.

3.1 Preprocessing Recall that R is a regularexpression of length m containing k strings L =L1, . . . , Lk. We will preprocess R in O(m) spaceand O(m log k) time as described below. First, definethe pruned regular expression R obtained by replac-ing each string Li by its index. Thus, R is a regularexpression over the alphabet 1, . . . , k. The lengthof R is O(k).

Given R we compute R, the TNFA N(R), andAC(L) using O(m log k) time and O(m) space. Foreach string Li ∈ L define start(Li) and end(Li) to bethe startpoint and endpoint, respectively, of the tran-sition labeled i, in N(R). For any subset of stringsL′ ⊆ L define start(L′) =

⋃Li∈L′ start(Li) and

end(L′) =⋃

Li∈L′ end(Li). Fig. 2(b) shows R andN(R) for the regular expression R = (aba|a)∗ba.The corresponding AC-automaton for aba, a, ba isshown in Fig. 2(c).

To represent the interaction between N(R) andAC(L), we construct and maintain a set of FIFOqueues F = F1, . . . , Fk associated with the stringsin L. Queue Fi stores a sequence of |Li| bits. LetS ⊆ start(L) and define the following operations onF .

Enqueue(S): For each queue Fi enqueue into theback of Fi a 1-bit if start(Li) ∈ S and

otherwise a 0-bit.

Front: Return the set of states S ⊆ end(L)such that end(Li) ∈ S iff the front bitof Fi is 1.

Using a standard cyclic list for each queue we cansupport the operations in constant time per queue.Both of these queue operations take O(k) time.

All in all, we use O(m) space and O(m log k) timefor preprocessing.

3.2 Matching We show how to match R to Qin O(k) time per character of Q. We computea sequence of state-sets S0, . . . , Sn in N(R) and asequence of states s0, . . . , sn in AC(L) as follows.Initially, set S0 := Close(θ), set s0 to be the root ofAC(L), and initialize the queues with all 0-bits. Atstep i of the algorithm, 1 ≤ i ≤ n, we perform thefollowing steps.

1. Compute Enqueue(Si−1 ∩ start(L)).

2. Set si := ∆(si−1, Q[i]).

3. Set S′ := Front(F ) ∩ end(occ(si))

4. Set Si := Close(S′).

Finally, we have that Q ∈ L(R) if and only θ ∈ Sn.Note the correspondence between the sequence of bitsin the queue and the states representing the strings inR within N(R). The queue and set operations takeO(k) time and traversing the AC-automaton takesO(n log k) time in total. Hence, with preprocessingthe entire algorithm uses O(nk + (n + m) log k) =O(nk+m log k) time. Note that all computation only


(a)(c)

4 5 6

R = (aba|a)∗ba

ba a

a

ab

6

R = (1|2)∗3(b)

a

a

ab

b

1, 3

AC(aba, a, ba)1

2

3

1, 2, 32

Figure 2: (a) N(R) for R = (aba|a)∗ba. (b) N(R) where R = (1|2)∗3 (c) AC-automaton for the strings inR.

requires a comparison-based model of computation.Hence, we have the following result.

Theorem 3.1. Given a regular expression of lengthm containing k strings and a string of lengthn, we can solve regular expression matching ina comparison-based model of computation in timeO(nk +m log k) and space O(m).

4 A Faster Algorithm

We show how to speed up the simple algorithm fromthe previous section to do matching in O(k logw/w+log k) time per character. We first give a high-level description of the algorithm by Bille [4] thatuses O(m logw/w + logm) per character. Then, weshow how to modify it to obtain our new bound.Essentially, we will be able to apply Bille’s Closeoperation directly to the TNFA N(R) of the prunedregular expression. The more interesting part willbe to speed-up the FIFO bit queues of non-uniformlength that represent the interaction between theTNFA and the multi-string matching.

4.1 Bille’s Speed-up for Thompson’s Algo-rithm We now describe the key features of the basicalgorithm of Bille [4]. Later we show how to handlestrings more efficiently as we did above for Thomp-son’s algorithm. Initially, we assume m ≥ w.

First, we decompose N(R) into a tree AS ofO(dm/we) micro TNFAs, each with at most w states.For each A ∈ AS, each child TNFA C is representedby a start and accepting state and a pseudo-transitionlabeled β 6∈ Σ connecting these. We can always con-struct a decomposition of N(R) as described above

since we can partition the parse tree into subtreesusing standard techniques and build the decomposi-tion from the TNFAs induced by the subtrees. Simi-lar decompositions are used in most of the fast algo-rithms [4–6,23].

Secondly, we store a state-set S as a set of localstate-set SA, A ∈ AS. We represent SA compactlyby mapping each state in A to a position in the range[0, O(w)] and store SA by the bit string consisting of1s in the positions corresponding the states in SA.For simplicity, we will identify SA with the bit stringrepresenting it. The mapping is constructed basedon a balanced separator decomposition of A of depthO(logw), which combined with some additional infor-mation allows us to compute local MoveA and CloseA

operations efficiently. To implement MoveA we en-sure that endpoints of transitions labeled by char-acters are mapped to consecutive positions. This isalways possible since the startpoint of such a transi-tion has outdegree 1 in TNFAs. We can now computeMoveA(SA, α) in constant time by shifting SA and&’ing the result with a mask containing a 1 in all po-sitions of states with incoming transitions labeled α.This idea is a simple variant of the classical Shift-Oralgorithm [3]. To implement CloseA we use propertiesof the separator decomposition to define a recursivealgorithm on the decomposition. Using precomputedmask combined with the mapping the recursive algo-rithm can be implemented in parallel for each level ofthe recursion using standard word operations. Eachlevel of the recursion is done in constant time leadingto an O(logw) algorithm for CloseA.

Finally, we use the local MoveA and CloseA al-


gorithm to obtain an algorithm for Move and Closeon N(R). To implement Move we simply performMoveA independently for each micro TNFA A ∈ ASusing O(|AS|) = O(m/w) time. To implement Closefrom CloseA we use the result that any cycle-free pathof ε-transitions in a TNFA uses at most one of theback transitions we get from the kleene star oper-ators [23]. This implies that we can compute theClose in two depth-first traversals of the decomposi-tion using constant time per micro TNFA. Hence, weuse O(|AS| logw) = O(m logw/w) time per charac-ter. For the case of m < w we only have one microTNFA and the depth of the separator decompositionis logm. Hence, we spend O(logm) time per char-acter. In general, we use O(m logw/w + logm) percharacter.

4.2 Speeding Up the Simple Algorithm Wenow show how to improve the simple algorithm toperform matching in O(k logw/w + log k) time percharacter. Initially, we assume that k ≥ w.

We now consider the TNFA N(R) for the prunedregular expression R that we defined in the previ-ous section. We decompose N(R) into a tree ASof O(k/w) micro TNFAs each with at most w states.We represent state-sets as in the algorithm by Bille [4]and we will use the same algorithm for Close opera-tion in step 4 of our simple algorithm. Hence, step 4takes O(|AS| logw) = O(k logw/w) time as desired.The remaining step steps only affect the endpoints ofnon-ε transitions and therefore we can perform themindependently on each A ∈ AS. We show how to dothis in O(logw) amortized time leading to the overallO(k logw/w) time bound per character.

Let A ∈ AS be a micro TNFA with at mostw states. Let LA be the set of kA ≤ w stringscorresponding to the non-ε labeled transitions in A.First, construct AC(LA). For each state s ∈ AC(LA)we represent the set end(occ(s)) as a bit string oflength x with a 1 in each position corresponding tothe states end(occ(s)) in A. Each of these bit stringsis stored in a single word. Since the total number ofstates for all AC-automata in AS is O(m) the totalspace used is O(m) and the total preprocessing timeis O(m log max

A∈fASkA) = O(m logw). Traversing

AC(LA) on the string Q takes O(log kA) time percharacter. Furthermore, we can retrieve end(occ(s))in step 3 in constant time. We can compute the two∩’s in steps 1 and 3 in constant time by a bitwise& operation. It remains to implement the queueoperations. In the following section we show how toimplement these in O(log kA) = O(logw) amortized

time, giving us a solution using O(k/w logw) timeper character.

4.2.1 Fast Bit Queues Let FA be the set ofkA ≤ w queues for the strings LA in A and let`1, . . . , `kA

be the lengths of the queues. We want tosupport local queue operations with input and outputstate-sets represented compactly as a bit string. Forsimplicity, we will identify start(LA) with end(LA)such that the input and output of the queue operationcorrespond directly. This is not a problem since thestates are consecutive in the mapping and we cantherefore translate between them in constant time. Ifwe use the implementation of the queue operationsfrom Section 3 we get a solution using O(w) time peroperation. Our goal in this section is to improve thisto the following.

Lemma 4.1. For a set of kA ≤ w bit queues wecan support Front in O(1) time and Enqueue inO(log kA) = O(logw) amortized time.

To prove Lemma 4.1 we divide into k′A shortqueues of length less than 2kA and k′′A long queuesof length at least 2kA.

Case 1: Short Queues Let ` < 2kA ≤ 2wbe the maximal length of a short queue. Eachshort queue fits in a double word. The key idea isto first pretend that all the short queues have thesame length ` and then selectively move some bitsforward in the queues to speed up the shorter queuesappropriately. We first explain the algorithm for asingle queue implemented within a larger queue andthen show how to implement it efficiently in parallelfor O(kA) queues. We assume w.l.o.g. that ` is apower of 2.

Consider a single bit queue of length `j ≤ `represented by a queue of length `. We want tomove all bits inserted into the queue forward by`− `j positions. To do so we insert log ` jump pointsJ1, . . . , J log `. If the ith least significant bit in thebinary representation of `−`j is 1 the ith jump pointJ i is active. When we insert a bit b into the back ofqueue we first move all bits forward and insert b at theback. Subsequently, for i = log `, . . . , 2, if Ji is activewe move the bit at position 2i to position 2i−1. Bythe choice of jump points it follows that a bit insertedinto the queue is moved forward by ` − `i positionsduring the lifetime of the bit within the queue.

Next we show how to implement the above algo-rithm in parallel. First, we represent all of the k′Abit queues in a combined queue F of length ` imple-mented as a cyclic list. Each entry in F stores a bit


string of length k′A where each bit represents an en-try in one of the k′A bit queues. Each k′A-bit entry isstored “vertically” in a single bit string. We computethe jump points for each queue and represent themvertically using log ` bit strings of length k′A with a1-bit indicating an active jump point. Let F i andF i−1 be entries in F at position 2i and 2i−1, respec-tively. We can move all bits in parallel from queueswith active jump point at position 2i to position 2i−1

as follows.

Z := F i & J i

F i := F i & ¬J i

F i−1 := F i−1 | ZHere, we extract the relevant bits into Z, remove

them from F i, and insert them back into the queueat F i−1. Each of the O(log `) moves at jump pointstakes constant time and hence the Enqueue operationtakes O(log `) = O(log kA) time. The Front operationis trivially implemented in constant time by returningthe front of F .

Case 2: Long Queues We now consider thek′′A long queues, each of length at least 2kA ≥ 2k′′A.The key idea is to represent the queues horizontallywith additional buffers of size k′′A in the back andfront of the queue represented vertically. The buffersallows us to convert between horizontal and verticalrepresentations every k′′A Enqueue operations whichwe can do efficiently using a fast algorithm fortransposition by Thorup [29].

First, we take each long queue Li and make anindividual horizontal representation as a bit stringover d`i/we words. As usual, the representation iscyclic. Second, we add two buffers, one for the backof the queue and one for the front of the queue. Theseare cyclic vertical buffers like those we used for theshort queues, and each has capacity for k′′A bits fromeach string. When we start, all bits in all buffersand queues are 0. The first k′′A Enqueue operationsare very simple. Each just adds a new word to theend of the vertical back buffer. We now apply theword matrix transposition of Thorup [29] so that wefor each queue Li get a word with k′′A bits for theback of the queue. Since k′′A ≤ w, the transpositiontakes O(k′′A log k′′A) time. We can now in O(k′′A) totaltime enqueue the results to the back of each of thehorizontal queues Li.

In general, the system works in periods of k′′AEnqueue operations. The Front operation simplyreturns the front word in the front buffer. This frontword is removed by the Enqueue operation which

also adds a new word to the end of the back buffer.When the period ends, we move the back bufferto the horizontal representation as described above.Symmetrically, we take the front k′′A bits of each Li

and transpose them to get a vertical representationfor the front buffer. The total time spent on a periodis O(k′′A log k′′A) which is O(log k′′A) time per Enqueueoperation.

To show Lemma 4.1 we simply partition FA intoshort and long queues and apply the algorithm forthe above cases. Note that the space for the queuesover all micro TNFAs is O(m).

4.3 Summing Up In summary, our algorithmfor regular expression matching spends O(m log k)time and O(m) space on the preprocessing. After-wards, it processes each character in O(|AS| logw) =O(k/w logw) time. So far this assumes k ≥ w. How-ever, if k < w, we can just use k bits in each word, andwith these reduced words, the above bound becomesO(log k). This completes the proof of Theorem 1.1.

5 Regular Expressions with Black-Boxes

Before addressing character class intervals we firstdiscuss how to extend the special treatment of stringsto more general automata within the same timebounds.

Let B1, . . . , Bk be a set of black-box automata.Each black-box is represented by a transition in thepruned TNFA and each black-box accepts a single bitinput and output. In Section 4.2 each black-box wasan automata that recognized a single string and wegave an algorithm to handle kA ≤ w such automatain O(log kA) time using our bit queues and a multi-string matching algorithm. In this setting none of theblack-boxes accepted the empty string and since allblack-boxes were of the same type we could handleall of them in parallel using a single algorithm.

To handle general black-box automata we needto modify our algorithm slightly. First, if Bi acceptsthe empty string we add a direct ε-transition fromthe startpoint to the endpoint of the transition forBi in the pruned TNFA. With this modification theO(k logw/w + log k) time Close algorithm from theprevious section works correctly. With different typesof black-boxes we create a bit mask for the startpointand endpoints of each type. We can then extractthe inputs bits for a given type independently of theother types. As before, the output bit positions canbe obtained by shifting the bit mask for the inputbits by one position.

With these fixes we can now implement general


black-boxes within our algorithm from Section 4.2efficiently. Specifically, let A be a micro TNFAwith kA ≤ w black-box automata of a constantnumber of types. If we can simulate each type inamortized O(log kA) time per operation the totaltime for regular expression matching with the black-boxes will be dominated by the time for the Closeoperation and we get the same time complexity as inTheorem 1.1.

In the following section we show how to handle≤ w character class intervals in amortized constanttime. Using these as black-boxes as described abovethis leads to Theorem 1.2.

6 Character Class Intervals

We now show how to support character class intervalsCx, y in the black-box framework described in theprevious section. For succinctness we will abbreviatecharacter class intervals by intervals in the following.It is convenient to define the following two types:C= x = Cx, x and C≤ x = C0, x. We havethat Cx, y = C= xC≤ y − x.

Recall that we identify each character class C bythe set it represents. It is instructive to consider howto implement some of the simple cases of intervals.It is easy to handle O(w) character classes, i.e.,intervals of the form C= 1, in constant time [3].For each character we store a bit string indicatingwhich character classes it is contained in. We canthen & this bit string with the input bits and copythe result to the output bits.

Intervals of the form Σ= x are closely relatedto our algorithm for multi-strings. In particular,suppose that we have a string matcher for the stringLi that determines if the current suffix of Q matchesLi and a black-box for Σ= |Li|. To implement theblack-box for Li we copy the input to Σ= |Li| andpass the character to the string matcher for Li. Theoutput for Li is then 1 if the output of Σ= |Li|and the string matcher is 1 and otherwise false.In the previous section we implemented Σ= |Li|and the string matcher using our bit queues of non-uniform length and a standard multi-string matchingalgorithm, respectively. Hence, to implement Σ= xwe may simply use a bit queue of length x.

To implement C= x we will use a similarapproach as above. The main idea is to think ofC=x as the combination of Σ=x and “the last xcharacters all are in C”. To implement the latter weintroduce an algorithm for maintaining generic non-uniform counters below. These will also be useful forimplementing C≤ x.

Define a trit counter with reset (abbreviated tritcounter) T with reset value ` as follows. The counterT stores a value in the range [0, `]. We want todetermine when T > 0 under a sequence of thefollowing operations: A reset sets T := `, a cancelsets T := 0, and a decrement sets T := T − 1 ifT > 0.

To implement multiple trit counters using word-level parallelism we code the inputs with trits−1, 0, 1, where 1 is reset, 0 is decrement and −1is cancel. Each of these trits is coded with 2 bits.The output is given as a bit string indicating whichtrit counters are positive. We will show how to im-plement O(w) non-uniform trit counters in constantamortized time in the next section. The main chal-lenge here is that the counters have individual resetvalues. Before doing this, we first show how imple-ment the character class intervals using this result.

For C≤ x we use a trit counter T with resetvalue x − 1. For each new character α we give T aninput −1 if α 6∈ C. Otherwise, we input the startstate of C≤ x to T . The output of C≤ x is theoutput of T . Hence, we output a 1 iff we input a 1no more than x characters ago and we have not hada character outside of C.

To finish the implementation of C= x we needto implement “the last x characters all are in C”. Forthis we use a trit counter T with reset value x − 1.For each new character α we input a 1 if α 6∈ C andotherwise we input a 0. The output for C= x is 1 ifthe output from Σ= x is 1 and T = 0. Otherwise,the output is 0. In other words, the output fromC= x is 1 iff we input a 1 x characters ago and allof the last x characters are in C.

Using standard word-level parallelism operationsit straightforward to implement the required inter-action between O(w) bit queues and trit counters inconstant time. Combined with our amortized con-stant time algorithm for maintaining O(w) trit coun-ters presented in the next section we obtain an algo-rithm for regular expression matching with characterclass intervals using O(k logw/w+ log k) per charac-ter.

6.1 Fast Trit Counters with Reset Let TA bea set of kA trit counters with reset values `1, . . . , `kA

.Our goal is to show the following result.

Lemma 6.1. We can maintain a set of kA ≤ wof non-uniform trit counters with reset in constantamortized time per operation.

The general idea to prove Lemma 6.1 is to splitthe kA counters into a sequence of subcounters which


we can efficiently manipulate in parallel. We firstexplain the algorithm for a single trit counter Twith reset value `. The algorithm will lend itself tostraightforward parallelization leading to Lemma 6.1.

The trit counter T with reset value ` is stored bya sequence of w subcounters t0, t1, . . . , tw−1, wheretz ∈ 0, 1, 2. Subcounter ti represents the valueti2i and the value of T is

∑0≤i≥0 ti2

i. Note thatthe subcounter representation of a specific value isnot unique. We define the initial configuration of Tto be the unique representation of the value ` suchthat a prefix t0, . . . , th of subcounters are all positiveand the remaining subcounters are 0. To efficientlyimplement cancel operations T will be either in aactive or passive state. The output of T is 1 if Tis active and 0 otherwise. Furthermore, at any giventime the be at most one active subcounter which canbe either starting or finishing.

In each operation we visit a prefix of the subcoun-ters t0, . . . , ti. The length of the prefix is determinedby the total number of operations. Specifically, inoperation o the endpoint i of the prefix is the posi-tion of the rightmost 0 in the binary representationof o, i.e., in operation 23 = 101112 we visit the pre-fix t0, t1, t2, t3 since the rightmost 0 is in position 3.Hence, subcounter ti is visited every 2i operationsand hence the amortized number of subcounters vis-ited per operation is constant.

Consider an concrete operation that visits theprefix t0, . . . , ti of subcounters. Recall that h is thelargest nonzero subcounter in the initial configurationand let z = min(i, h).

We implement the operations as follows. If we geta cancel we simply set T to passive. If we get a resetwe reset the subcounters t0, . . . , tz−1 according to theinitial configuration. The new active subcounter willbe tz. If i < h we set tz to starting; otherwise, z = h,and we set it to finishing.

If we get a decrement we proceed as follows. IfT is passive we do nothing. If T is active let ta bethe active subcounter. If i < a we also do nothing.Otherwise, we decrement ta by 1. There are two casesto consider:

1. ta is starting. The new active subcounter is tz.If i < h we set tz to starting and otherwise weset it to finishing.

2. ta is finishing. The new active subcounter is tb,where tb is the largest nonzero subcounter amongt0, . . . , ta. We set tb to finishing. If there is nosuch subcounter we set the T to passive.

We argue that the algorithm correctly implementa trit counter with reset value `. We first show thatthe algorithm never decreases a subcounter below 0in a sequence of decrements following an activationof the counter.

Initially, the first active subcounter is positivesince it is in the initial configuration. When we choosean active subcounter in the decrement operationthere are two cases to consider. If the currentactive subcounter ta is finishing the algorithm selectsthe next active subcounter to have a positive value.Hence, suppose that ta is starting. We have that th isnever starting. Furthermore, if a = z then ta cannotbe starting. Hence, a < min(h, i) = z and thereforethe new active subcounter tz is positive. Inductively,it follows that we never make a subcounter negative.

The initial configuration represents the value `.Whenever we decrement a subcounter ti there hasbeen exactly 2i operations since we last decrementedanything. If we do not get cancel operation input thetrit counter becomes passive when all subcounters are0. It follows that the algorithm correctly implementsa trit counter.

It remains to implement the algorithm for kA ≤w trit counters efficiently. To do so we representeach level of the subcounters vertically in a singlebit string, i.e., all the kA subcounters at level i arestored consecutively in a single bit string. Similarly,we represent the state (passive, active, starting,finishing) of each subcounter vertically. Finally, westore the state (passive or active) of the kA tritcounters in a single bit string. In total we use O(kAw)bits and therefore the space is O(kA). With thisrepresentation we can now read and write a singlelevel of subcounters in constant time.

The input to the trit counters is given compactlyas a bit string and the output is the bit stringrepresenting the state of all trit counters. Using aconstant number of precomputed masks to compareand extract fields from the representation and theinput bit string the algorithm is now straightforwardto implement using standard word-level parallelismoperations in constant time per visited subcounterlevel. Since we visit an amortized constant numberof subcounter levels in each operation the result ofLemma 6.1 follows.

6.2 Summing Up Combining the result ofLemma 6.1 with our black-box extension of our al-gorithm from Section 4.2 we obtain an algorithm forregular expression matching with character class in-tervals that uses O(k logw/w+log k) time per charac-


ter as in the bound from Theorem 1.1. We count theextra space and preprocessing needed for the charac-ter class intervals.

Each character class interval Cx, y requiresa bit queue of length x and a trit counter withreset value y − x. The bit queue uses O(x) spaceand preprocessing time. The trit counter uses onlyconstant space and we can initialize all trit countersin O(m) time.

Given a character we also need to determinewhich character classes it belong to. To do this westore for each character α a list of the micro TNFAsthat contains α in a character class. With eachsuch micro TNFA we store a bit string indicating thecharacter classes that contain α. Each bit string canbe stored in a single word. We keep these lists sortedby the traversal order of the micro TNFAs such thatwe can retrieve each bit string during the traversalin constant time. Note that in the algorithm forC= x we also need the complement of C whichwe can compute by negation in constant time.

We only store a bit string for a character α if themicro TNFA contains a character class containing αand therefore the total number of bit strings is O(m).Hence, the total space for all our lists is O(m). If westore the table of lists by a complete table we useO(|Σ|) additional space. For small alphabets thisgives a simple and practical solution. To get thebound of Theorem 1.2 we use that the total number ofdifferent characters in all characters classes is at mostm. Hence, only O(m) entries in the table will benon-empty. With deterministic dictionaries [12] wecan therefore represent the table with constant timequery in O(m) space after O(m logm) preprocessing.

In total, we use O(X + m) additional spaceand O(X + m logm) additional preprocessing. Thiscompletes the proof of Theorem 1.2.

References

[1] A. V. Aho and M. J. Corasick. Efficient stringmatching: an aid to bibliographic search. Commun.ACM, 18(6):333–340, 1975.

[2] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers:principles, techniques, and tools. Addison-WesleyLongman Publishing Co., Inc., Boston, MA, USA,1986.

[3] R. Baeza-Yates and G. H. Gonnet. A new approachto text searching. Commun. ACM, 35(10):74–82,1992.

[4] P. Bille. New algorithms for regular expressionmatching. In Proc. 33rd ICALP, pages 643–654,2006.

[5] P. Bille and M. Farach-Colton. Fast and compactregular expression matching. Theoret. Comput. Sci.,409:486 – 496, 2008.

[6] P. Bille and M. Thorup. Faster regular expressionmatching. In Proc. 36th ICALP, pages 171–182,2009.

[7] P. Bucher and A. Bairoch. A generalized profilesyntax for biomolecular sequence motifs and itsfunction in automatic sequence interpretation. InProc. 2nd ISMB, pages 53–61, 1994.

[8] K. Fredriksson and S. Grabowski. Efficient algo-rithms for pattern matching with general gaps, char-acter classes, and transposition invariance. Inf.Retr., 11(4):335–357, 2008.

[9] K. Fredriksson and S. Grabowski. Nested countersin bit-parallel string matching. In Proc. 3rd LATA,pages 338–349, 2009.

[10] Z. Galil. Open problems in stringology. In A. Apos-tolico and Z. Galil, editors, Combinatorial problemson words, NATO ASI Series, Vol. F12, pages 1–8.1985.

[11] V. M. Glushkov. The abstract theory of automata.Russian Math. Surveys, 16(5):1–53, 1961.

[12] T. Hagerup, P. B. Miltersen, and R. Pagh. De-terministic dictionaries. J. Algorithms, 41(1):69–85,2001.

[13] K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch.The prosite database, its status in. Nucleic AcidsRes, (27):215–219, 1999.

[14] T. Johnson, S. Muthukrishnan, and I. Rozen-baum. Monitoring regular expressions on out-of-order streams. In Proc. 23nd ICDE, pages 1315–1319, 2007.

[15] B. Kernighan and D. Ritchie. The C ProgrammingLanguage (2nd Ed.). Prentice-Hall, 1988. Firstedition from 1978.

[16] S. C. Kleene. Representation of events in nerve netsand finite automata. In C. E. Shannon and J. Mc-Carthy, editors, Automata Studies, Ann. Math.Stud. No. 34, pages 3–41. Princeton U. Press, 1956.

[17] D. E. Knuth, J. James H. Morris, and V. R. Pratt.Fast pattern matching in strings. SIAM J. Comput.,6(2):323–350, 1977.

[18] Q. Li and B. Moon. Indexing and querying XMLdata for regular path expressions. In Proc. 27thVLDB, pages 361–370, 2001.

[19] R. McNaughton and H. Yamada. Regular expres-sions and state graphs for automata. IRE Trans. onElectronic Computers, 9(1):39–47, 1960.

[20] M. Morgante, A. Policriti, N. Vitacolonna, andA. Zuccolo. Structured motifs search. J. Comput.Bio., 12(8):1065–1082, 2005.

[21] M. Murata. Extended path expressions of XML. InProc. 20th PODS, pages 126–137, 2001.

[22] E. W. Myers. Approximate matching of networkexpressions with spacers. J. Comput. Bio., 3(1):33–51, 1992.


[23] E. W. Myers. A four-russian algorithm for regularexpression pattern matching. J. ACM, 39(2):430–448, 1992.

[24] G. Myers and G. Mehldau. A system for patternmatching applications on biosequences. CABIOS,9(3):299–314, 1993.

[25] G. Navarro and M. Raffinot. Fast and simple char-acter classes and bounded gaps pattern matching,with applications to protein searching. J. Comput.Bio., 10(6):903–923, 2003.

[26] S. Sen, O. Spatscheck, and D. Wang. Accurate,scalable in-network identification of p2p traffic usingapplication signatures. In Proc. 13th WWW, pages512–521, 2004.

[27] B. Stroustrup. The C++ Programming Language:Special Edition (3rd Edition). Addison-Wesley,2000. First edition from 1985.

[28] K. Thompson. Regular expression search algorithm.Commun. ACM, 11:419–422, 1968.

[29] M. Thorup. Randomized sorting in O(n log log n)time and linear space using addition, shift, and bit-wise boolean operations. J. Algorithms, 42(2):205–230, 2002. Announced at SODA 1997.

[30] L. Wall. The Perl Programming Language. PrenticeHall Software Series, 1994.

[31] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, andR. H. Katz. Fast and memory-efficient regularexpression matching for deep packet inspection. InProc. ANCS, pages 93–102, 2006.


Regular expression matching with multi-strings and...

Documents

Transcript of Regular expression matching with multi-strings and...