Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.
-
date post
21-Dec-2015 -
Category
Documents
-
view
240 -
download
0
Transcript of Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.
Regular Expressions Regular Expressions intointo Finite AutomataFinite Automata
Anne Bruggemann-Klein
Presenting: Rutie Mesing
Outline
Building the Glushkov automaton in O((size of E)2) Defining the Star Normal Form
Building the Glushkov automaton in O(size of E) for deterministic regular expressions
Strong and weak unambiguity Quadratic time decision algorithm for weak
unambiguity
General definitions
E – regular expression L(E) – the language specified by the
regular expression E The size of a regular expression E
The number of symbols it contain, including the syntactic symbols such as brackets, +, ., and *
The size of an NFA The number of its transitions
pos(E), (x)
(a+b)*a(ab)* (a1+b2)*a3(a4b5)*
pos(E) – the set of subscripted symbols in an expression E
x, y, z are used to denote positions a, b, c are used for elements of
For a position x, (x) is the corresponding symbol of
Positions sets: first(E), last(E) inductive definition
[E = or ]first(E) = last(E) =
[E = x]first(E) = last(E) = {x}
[E = F + G]first(E) = first(F) first(G)
last(E) = last(F) last(G)
[E = FG]
first(E) = first(F) first(G) if ∈L(F)
first(F) otherwise
last(E) = last(F) last(G) if ∈L(G)
last(G) otherwise
[E = F *]first(E) = first(F)
last(E) = last(F)
Positions sets: follow(E,x) inductive definition
[E = or ]E has no positions
[E = x]follow(E,x) =
[E = F + G]follow(E,x) = follow(F,x) if x∈pos(F)
follow(G,x) if x∈pos(G)
[E = FG]follow(E,x) = follow(F,x) if x∈pos(F)\ last(F)
follow(F,x) first(G) if x∈last(F)
follow(G,x) if x∈pos(G)
[E = F*]follow(E,x) = follow(F,x) if x∈pos(F)\ last(F)
follow(F,x) first(F) if x∈last(F)
The Glushkov Automaton (NFA)
ME = (QE {qI}, , E, qI, FE) QE = pos(E) For a∈ ,
let E (qI,a) = {x| x∈first(E), (x)=a}
For x∈pos(E), a∈,
let E(x,a) = {y| y∈follow(E,x), (y)=a}
FE = last(E){qI} if ∈L(ME)
last(E) otherwise
Proposition 2.1 L(L(MMEE) = L() = L(EE))
Example (a*+ba)* = (a1*+b2a3)*
bb
b
a
a
a
a
1
2 3
The canonical method (O(n3)) for computing first, last & follow
Converting E into a syntax tree Leafs are labeled with: , or positions of E Internal nodes: +, . or * Building time: O(n) (n = size of E) Each node v in the syntax tree corresponds to a
subexpression EEvv of E.
Postorder traversal of the syntax tree computing:
nullable(nullable(vv)): Boolean – can Ev contain first(first(vv)), last(last(vv)): 2pos(E) For each xpos(E) there is a global variable: follow(follow(xx)): 2pos(E) O(n3)
case v is a node labeled :
nullable (v) := false; first(v) := ; last(v) := ;
v is a node labeled : nullable (v) := true; first(v) := ; last(v) := ;
v is a node labeled x: nullable (v) := false; follow (x) := ; first(v) := {x}; last(v) := {x};
if nullable(rightchild) then last(v) := last(leftchild ) last(rightchild ) ( ) else last(v) := last(rightchild );
v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x) first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case;
v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable
(rightchild ); first(v) := first(leftchild ) first(rightchild ); ( )
last(v) := last(leftchild ) last(rightchild ); ( )
v is a node labeled . : nullable (v) := nullable (leftchild ) and nullable
(rightchild ); for each x in last(leftchild) do follow (x) := follow (x) first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild ) first(rightchild ) ( ) else first(v) := first(leftchild );
Lemma 2.5
The following invariant holds after node v has been visited.
1. nullable (v) is true if and only if ∈L(Ev ).
2. first(v) = first(Ev ), last(v) = last(Ev ). Furthermore, if node v has been visited but the
parent of v has not, then 3. follow (x) = follow (Ev, x) for x ∈ pos(Ev ).
Especially, for the root note v0 ,
1. first(v0 ) = first(E), last(v0 ) = last(E). 2. follow (x) = follow (E, x), for x∈pos(E).
Observations All unions labeled ( ) or ( ) are disjoint
pos(F) pos(G) = Only unions labeled ( ) are not
necessarily disjoint Example: E=(a*b*)*, H=a*b*
Elements of first(H) are added to follow(H,x) for x∈last(H), but some elements of first(H) may already belong to follow(H,x) for some x∈last(H).
O(n3) for computing first(E), last(E) and follow(E,x)
Computing first, last & follow in a better time bound (O(n2))
General Strategy: We only consider expressions for which all
unions, including the ones of type ( ), are disjoint.
Such expressions are in star normal form (SNF).
Then we show that our algorithm runs in time O(size(ME)) for expressions E in star normal form.
Finally, we show why the restriction to star normal form is justified.
Star Normal Form - Star Normal Form - DefinitionDefinition
A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions:
follow(H, last(H)) first(H) =
and ∉L(H)
hold.
Lemma 2.7 Let E be a regular expression in star normal form. ME can be computed from E in time O(size(E) + size(ME)) Proof
( ) takes constant time (list concatenation). ( ) or ( ): Observation:
For any subexp. F of subexp. G of E, x∈pos(F) follow(F,x) follow(G,x) follow(E,x)
Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow(Ev,x) that are not present in any of the subexpressions of Ev.
Total run time spent in instructions ( ) or ( ):
x ∈ pos(E) | follow(E,x) |
disjoint unions (SNF)
Which is less or equal to the number of transitions in Which is less or equal to the number of transitions in MMEE
Why the restriction to star Why the restriction to star normal form is justifiednormal form is justified
Theorem 3.1 For each regular expression E, there
is a regular expression E such that ME = ME (Glushkov Automaton) E is in star normal form E can be computed from E in linear time.
From starred expression E* into Eo*
Goal: SNF conditions fulfilled for Eo
Observation After removing from ME all “feedback”
transitions leading from a final states (apart from
qi)
to states that qi is directly connected to,
and changing qi to be non final
The resulting NFA is the Glushkov automaton of E
with follow(E,last(E))first(E)=.
Example E = (a1*b2*)*
b
b
aa
1
2
a
b
Eo = (a1+b2)
b
1
2
a
E - inductive definition
[E = or ]Eo =
[E = a]Eo = E
[E = F + G]Eo = Fo + Go
FG if ∉L(F) ∉L(G)
[E = FG]Eo = FoG if ∉L(F) ∈L(G)
FGo if ∈L(F) ∉L(G)
Fo + Go (!) if ∈L(F) ∈L(G)
[E = F*] Eo = Fo )!(
Example E = (a1*b2*)*
b
b
aa
1
2
a
b
Eo = (a1+b2)
b
1
2
a
Lemma 3.31. size(Eo) ≤ size(E).
2. ∉L(Eo)
3. pos(Eo) = pos(E).
4. first(Eo) = first(E), last(Eo) = last(E).
5. follow (Eo, x) = follow (E, x), for all x ∈ pos(E) \ last(E).
6. follow (Eo, x) = follow (E, x) \ first(E), for all x∈last(E),
follow (Eo, last(Eo )) first(Eo) = 7. follow (Eo*, x) = follow (E*, x), for all x∈pos(E).
8. ME* = ME * o
The proof is in induction on EClaims 7, 8 follow directly from 5 and 6
From E to E
If we substitute in E each starred subexpression H* with H* Proceeding bottom up in E
We can expect to get an expression E in star normal form with ME=ME
E - inductive definition
Example E = (a1*b2*)*
b
b
aa
1
2
a
b
Eo = (a1+b2)
b
1
2
a
[E = a , or ]E = E
[E = F + G]E = F + G
[E = FG]FG
[E = F*] E = F*
E=(a*b*) *E=(a*b*)* = (a*b*)*
) = a+b) = *(a+b*(
ME = ME
Lemma 3.5 L(E) = L(E) size(E) size(E) pos(E) = pos(E) first(E) = first(E) last(E) = last(E) follow(E, x) =
follow(E,x), for x∈pos(E)
qI∈FE if and only if qI∈FE
These claims imply the first part of Theorem 3.1,
ME = ME
E in SNF The proof is by induction on the size
of E. The star case [E = F*] E = F*
SNF conditions hold for F (Lemma 3.3) F in SNF, by induction hypothesis Need to show that F = F
follow(H, last(H )) first(H ) =
∉L(H)
Lemma 3.6E = E
E = E
E = E
(1) E = F = F = E
Proof – by induction on E The star case [E = F*]
(2) E = F* = F = F = F = E
(3) E = F* = F* = F* = F* = E
def
def indu
def
def def & (1) indu
def def (2) indu & (1) def
Compute E from E in linear time
For H subexpression of E, we need H and H for computing E
H and H are computed simultaneously during the postorder traversal
Left to prove that at each node only a constant amount of time is spent
Conclusions so far Theorem 3.9
The Glushkov automaton ME can be computed from a regular expression E in time linear in size(E)+size(ME)
Proof E is computed from E in linear time. E is in star normal form ME can be computed from E in time
O(size(E)+size(ME))
Deterministic regular expression
A regular expression E is deterministic if the corresponding NFA ME is deterministic.
Theorem 3.11 1. It can be decided in linear time whether
a regular expression E is deterministic.
2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time.
Theorem 3.11 - Proof E is deterministic if and only if E is
Isomorphic Glushkov automata
we can assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x)
for xpos(E) incrementally keeping track of the follow(E,x) in a |pos(E)||| matrix
E= (a1+b2)* E= (a1+b2)*a3
ab1a1b2
2a1b2
pos
ab1a1 & a3b2
2a1 & a3b2
3
pos
E is determinis
tic
E is nondeterminist
ic
Ambiguity in automata and expressions
Unambiguous NFA – definition: for each word w, there is at most one path from the initial state to
a final state that spells out w. Weakly unambiguous Intuition
Each word of E has a unique path through E Definition
A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous.
Strongly unambiguous Intuition
Each word of E can be uniquely decomposed into subwords of E
Strongly unambiguous
]E = or a[E is strongly unambiguous
]E = F + G[E is strongly unambiguous if F and G are strongly unambiguous and L(F) and L(G) are disjoint.
]E = FG[E is strongly unambiguous if F and G are strongly unambiguous and the concatenation of L(F) and L(G) is unambiguous
]E = F*[ E is strongly unambiguous if F is strongly unambiguous and the star of L(F) is unambiguous.
Concatenation – L.L’ is unambiguous if v,wL, v’,w’L’, vv’=ww’ v=w and v’=w’.
L* is unambiguous if v1...vmL, w1…wnL, m,n0, v1…vm=w1…wn m=n and vi=wi for 1im.
Strongly unambiguousIn terms of automata
Let M’E be the NFA recognizing L(E) according to any of the standard constructions
Lemma 4.5 E is strongly unambiguous if and only if M’E is unambiguous
Lemma 4.6 If E is strongly unambiguous, then E is weakly unambiguous Proof
Elimination of transitions transforms M’E into ME. Different paths in M’E spelling out a word w correspond to
different paths in ME doing the same. Unambiguity of M’E (Lemma 4.5) unambiguity of ME
Lemma 4.7 – weakly unambiguous
]E = or a[E is weakly unambiguous
]E = F + G[E is weakly unambiguous if and only if F and G are weakly unambiguous and at most is both in L(F ) and L(G).
]E = FG[E is weakly unambiguous if and only if F and G are weakly unambiguous and the concatenation of L(F ) and L(G) is unambiguous
]E = F*[
Let follow (F,last(F))first(F) = , L(F ).
Then, E is weakly unambiguous if and only if F is weakly unambiguous and the star of L(F ) is
unambiguous
Epsilon Normal Form Epsilon Normal Form condition:
No subexpression of E denotes the empty word umbiguously
]E = or a[E is in epsilon normal form
]E = F + G[E is in epsilon normal form if F and G are in epsilon normal form and L(F)L(G)
]E = FG[E is in epsilon normal form if F and G are in epsilon normal form
]E = F*[E is in epsilon normal form if F is in epsilon normal form and L(F)
Strongly unambiguous expressions
are in star and in epsilon normal form
Lemma 4.10 If E* is strongly unambiguous, then
follow(E,last(E))first(E) =
Proof Assume that there exist xlast(E),
yfollow(E,x)first(E), zlast(E) x is a final state in ME. (and also z) x1...xn x yy1…ymz is a path through ME
But this path is also the composition of two paths through ME
This makes L(E)* ambiguous.
Theorem 4.9 E is strongly unambiguous if and only if
1. E is weakly unambiguous2. E is in star normal form 3. E is in epsilon normal form
Proof For expressions in star and epsilon normal form, weak
and strong unambiguity are identical (using Lemma 4.7) Strongly unambiguous expressions are in star and in
epsilon normal form. (Lemma 4.10)
Test for weak unambiguity in quadratic time
Theorem 4.11 Regular expressions in epsilon normal form can be
tested for weak unambiguity in quadratic time. Proof
Let E be in epsilon normal form. E can be transformed into star normal form E
without changing the Glushkov automaton linear time.
E is also in epsilon normal form. E is weakly unambiguous if and only if E is if and only if E
is strongly unambiguous. strong unambiguity of expressions can be decided in
quadratic time
Open problems It is easy to see that a regular expression can be tested
for epsilon normal form in linear time.
Can a given regular expression be transformed into epsilon normal form in linear time?
Our transformation into star normal form can deal with starred subexpressions.
Hence, the crucial point is how expressions E = F+G with L(F)L(G) can be handled.
A straight forward approach would eliminate the empty string either from L(F) or from L(G).
This opens up another question:
Is there a linear time algorithm transforming a regular expression E into an expression E’ with L(E’) = L(E)\{}?