LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
-
Upload
tabitha-hood -
Category
Documents
-
view
214 -
download
2
Transcript of LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Faculty of IT - HCMUT Lexical Analysis 2
Outline
• Introduction to Lexical Analysis• Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 3
Introduction
• Read the input characters
• Produce as output a sequence of tokens
• Eliminate white space and comments
lexical analyzer
parser
symbol table
source program
token
get next token
Faculty of IT - HCMUT Lexical Analysis 4
Why ?
• Simplify design
• Improve compiler efficiency
• Enhance compiler portability
Faculty of IT - HCMUT Lexical Analysis 5
Tokens, Patterns, Lexemes
Token Sample Lexeme Informal description of patternconst const const
if if if
relation <,<=,==,!=,>,>= < or <= or == or != or > or >=
id pi, count, x2 letter followed by letters or digits
num 3.14, 25, 6.02E3 any numeric constant
literal “core dumped” any characters between “ and “ except “
Faculty of IT - HCMUT Lexical Analysis 6
Outline
• Introduction • Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 7
Alphabet, Strings and Languages
• Alphabet ∑: any finite set of symbols– The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…}– The binary alphabet {0,1}– The ASCII alphabet
• String: a finite sequence of symbols drawn from ∑ :– Length |s| of a string s: the number of symbols in s– The empty string, denoted , || = 0
• Language: any set of strings over ∑; – its two special cases:
: the empty set• {}
Faculty of IT - HCMUT Lexical Analysis 8
Examples of Languages
• ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…}– Vietnamese language
• ∑ = {0,1}– A string is an instruction– The set of Pentium instructions
• ∑ = the ASCII set– A string is a program– The set of C programs
Faculty of IT - HCMUT Lexical Analysis 9
Terms (Fig.3.7)
Term Definitionprefix of s a string obtained by removing 0 or more trailing
symbols of s;e.g. ban is a prefix of banana
suffix of s a string formed by deleting 0 or more the leading symbols of s;e.g. na is a suffix of banana
substring of s a string obtained by deleting a prefix and a suffix from s;e.g. nan is a substring of banana
proper prefix, suffix or substring of s
Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s x
Faculty of IT - HCMUT Lexical Analysis 10
String operations
• String concatenation– If x and y are strings, xy is the string formed
by appending y to x.E.g.: x = hom, y = nay xy = homnay
is the identity: y = y; x = x
• String exponentiation– s0 = – si = si-1s
E.g. s = 01, s0 = , s2 = 0101, s3 = 010101
Faculty of IT - HCMUT Lexical Analysis 11
Language Operations (Fig 3.8)
Term Definition
union: L M L M = { s | s L or s M }
concatenation: LM LM= { st | s L or t M }
Kleene closure: L* L* = L0 L LL LLL …
where L0 = {}
0 or more concatenations of L
positive closure: L+ L+ = L LL LLL …
1 or more concatenations of L
Faculty of IT - HCMUT Lexical Analysis 12
Examples
• L = {A,B,…,Z,a,b,…,z}• D = {0,1,…,9}
Example Language
L D
LD
L4
L*
L(L D)*
D+
letters and digits
strings consists of a letter followed by a digit
all four-letter strings
all strings of letters, including
all strings of letters and digits beginning with a letter
all strings of one or more digits
Faculty of IT - HCMUT Lexical Analysis 13
Regular Expressions (Res) over Alphabet ∑
• Inductive base:1. is a RE, denoting the RL {}2. a ∑ is a RE, denoting the RL {a}
• Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then
3. (r)|(s) is a RE, denoting the RL L(r) L(s)4. (r)(s) is a RE, denoting the RL L(r)L(s)5. (r)* is a RE, denoting the RL (L(r))*6. (r) is a RE, denoting the RL L(r)
Faculty of IT - HCMUT Lexical Analysis 14
Precedence and Associativity
• Precedence:– “*” has the highest precedence– “concatenation” has the second highest precedence– “|” has the lowest precedence
• Associativity:– all are left-associative
E.g.: (a)|((b)*(c)) a|b*c
Unnecessary parentheses can be removed
Faculty of IT - HCMUT Lexical Analysis 15
Example
• ∑ = {a, b}
1. a|b denotes {a,b}
2. (a|b)(a|b) denotes {aa,ab,ba,bb}
3. a* denotes {,a,aa,aaa,aaaa,…}
4. (a|b)* denotes ?
5. a|a*b denotes ?
Faculty of IT - HCMUT Lexical Analysis 16
Notational Shorthands
• One or more instances +: r+ = rr*– denotes the language (L(r))+
– has the same precedence and associativity as *
• Zero or one instance ?: r? = r|– denotes the language (L(r) {})
• Character classes– [abc] denotes a|b|c– [A-Z] denotes A|B|…|Z– [a-zA-Z_][a-zA-Z0-9_]* denotes ?
Faculty of IT - HCMUT Lexical Analysis 17
Outline
• Introduction • Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 19
Nondeterministic finite automata
• A nondeterministic finite automaton (NFA) is a mathematical model that consists of– a finite set of states S– a set of input symbols ∑– a transition function move: S ∑ S
– a start state s0
– a finite set of final or accepting states F
Faculty of IT - HCMUT Lexical Analysis 20
Transition graph
• state
transition
start state
final state
A Ba
A
A
A
Faculty of IT - HCMUT Lexical Analysis 21
Transition table
a b
0 {0,1} {0}
1 - {2}
2 - {3}
Input symbolState
Faculty of IT - HCMUT Lexical Analysis 22
Acceptance
• A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x.
A B
0
1
01010
01011
A B A B A B0 1 0 1 0
A B A B A ?0 1 0 1 1error
01
0
Faculty of IT - HCMUT Lexical Analysis 23
Deterministic finite automata
• A deterministic finite automaton (DFA) is a special case of NFA in which
1. no state has an -transition, and
2. for each state s and input symbol a, there is at most one edge labeled a leaving s.
Faculty of IT - HCMUT Lexical Analysis 24
Thompson’s construction of NFA from REs
• guided by the syntactic structure of the RE r
• For ,
• For a in ∑
i f
i fa
Faculty of IT - HCMUT Lexical Analysis 25
Thompson’s construction (cont’d)
• Suppose N(s) and N(t) are NFA’s for REs s and t– For s|t,
– For st,
– For s*,
– For (s), use N(s) itself
N(s)
N(t)i f
N(t)N(s)i f
N(t)i f
Faculty of IT - HCMUT Lexical Analysis 26
Outline
• Introduction • Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction)– DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 27
Subset construction
Operation Description
-closure(s) Set of NFA states reachable from state s on -transition alone
-closure(T) Set of NFA states reachable from some state s in T on -transition alone
move(T,a) Set of NFA states to which there is a transition on input a from some state s in T
• s : an NFA state
• T : a set of NFA states
Faculty of IT - HCMUT Lexical Analysis 28
Subset construction (cont’d)
Let s0 be the start state of the NFA;
Dstates contains the only unmarked state -closure(s0);while there is an unmarked state T in Dstates do begin
mark Tfor each input symbol a do begin
U := -closure(move(T; a));if U is not in Dstates then
Add U as an unmarked state to Dstates;DTran[T; a] := U;
end;end;
Faculty of IT - HCMUT Lexical Analysis 29
DFA
• Let (∑, S, T, F, s0) be the original NFA. The DFA is:
• The alphabet: ∑ • The states: all states in Dstates• The transitions: DTran• The accepting states: all states in Dstates
containing at least one accepting state in F of the NFA
• The start state: -closure(s0)
Faculty of IT - HCMUT Lexical Analysis 30
Outline
• Introduction • Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 31
Minimise a DFA
Initially, create two states:1. one is the set of all final states: F2. the other is the set of all non-final states: S - F
while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑Let t1,…, tn be the successor states to s1,…, sn under cif (t1,…, tn don't all belong to the same state) {
Split S into new states so that si and sj remain in the
same state iff ti and tj are in the same state
}}
Faculty of IT - HCMUT Lexical Analysis 32
Example
A B D E
Cb
b
b
bb
a
a
a aa
Step1: {A,B,C,D} {E}
For a, {B,B,B,B}
For b, {C,D,C,E}
Split {A,B,C} {D} {E}
Step 2:
For b, {C,D,C}
Split {A,C} {B} {D} {E}
Step 3:
For a, {B,B}
For b, {C,C}
Terminate
A B D Eb
b
b
bba
a aa
Faculty of IT - HCMUT Lexical Analysis 33
Outline
• Introduction • Token specification
– Language– Regular Expressions (REs)
• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)
• Programming
Faculty of IT - HCMUT Lexical Analysis 34
Input Bufferingbegin…
Scanner
eof
if (forward at end of first half) {reload second halfforward++
} else if (forward at end of second half) {
reload first halfforward = 0
} elseforward++
Faculty of IT - HCMUT Lexical Analysis 35
Input Bufferingbegin…
Scanner
eof
eof
eof
forward = forward + 1if (forward↑=eof) {
if (forward at end of first half) {reload second halfforward++
} else if (forward at end of second half) {
reload first halfforward = 0
} elseterminate the analysis
}
Faculty of IT - HCMUT Lexical Analysis 36
Transition Diagrams
relop <= | < |<> 0 1 2
3
4
< =
>
other
return(relop,LE)
return(relop,NE)
return(relop,LT)
id letter(letter|digit)* 5 6 7letter
letter or digit
other return(id,lexeme)
Transition diagram is a DFA in which there is no edge leaving out of a final state
Faculty of IT - HCMUT Lexical Analysis 37
Implementationtoken nexttoken() {
while (1) { switch (state) {
case 0: c = nextchar(); if (c == ‘<‘) state = 1;
else state = fail(0);break;
case 1: c = nextchar();if (c == ‘=‘) state = 2;else if (c == ‘>’ state = 3;else state = 4;break;
case 2: retract(0); return new
Token(relop,”<=”); case 4: retract(1);
return new Token(relop,”<”);
case 5: c = nextchar(); if (Character.isLetter(c))
state = 6;else state = fail(5);break;
case 6: c = nextchar();if (Character.isLetter(c)
||Character.isDigit(c)) continue;
else state = 7;break;
case 7: retract(1); return new Token(id,
getLexeme());
Faculty of IT - HCMUT Lexical Analysis 38
Implemetation (cont’d)
int fail(int current_state) {
forward = beginning;
switch (current_state) {
case 0: return 5;
case 5: error();
}
}
void retract(int flag) {
if (flag ==1)
move forward back
get lexeme from beginning to forward
move forward onward
beginning = forward
state = 0
}
b│e│g│i│n│:│=│ │ │…