Writing for publications workshop Prabhas Chongstitvatana Chulalongkorn University.
Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...
-
Upload
theresa-ross -
Category
Documents
-
view
216 -
download
0
Transcript of Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...
Lexical Analysis
Natawut Nupairoj, Ph.D.
Department of Computer EngineeringChulalongkorn University
Outline
Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine.
Front-End Components
ScannerSource program(text stream)
Parser
IntermediateRepresentation(file or in memory)
SemanticAnalyzer
Front-End
Construct parse tree.
Group token.
next-token
token
SymbolTable
m a i n ( ) {
Check semantic/contextual.
identifiermain
symbol(
parse-tree
Tasks for Scanner
Read input and group tokens for Parser. Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions
Benefits
Simpler design parser doesn’t worry about comments and white spac
es.
More efficient scanner optimize the scanning process only. use specialize buffering techniques.
Portability handle standard symbols on different platforms.
Basic Terminology
Tokena set of stringsEx: token = identifier
Lexemea sequence of characters in the source progra
m matched by the pattern for a token.Ex: lexeme = counter
Basic Terminology
Pattern a description of strings that can belong to a particular
token set. Ex: pattern = letter followed by letters or digit
{A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*
Token
const
if
relation
id
num
literal
Lexeme
const
if
<, <=, …, >=
counter, x, y
12.53, 1.42E-10
“Hello World”
Pattern
const
if
comparison symbols
letter (letter | digit)*
any numeric constant
characters between “
Language and Lexical Analysis
Fixed-format input i.e. FORTRANmust consider the alignment of a lexeme.difficult to scan.
No reserved words i.e. PL/Ikeywords vs. id ? -- complex rules.
if if = then then then := else; else else := then;
Regular Expression Revisited
is a regular expression that denotes {}. If a is an alphabet, a is a regular expressio
n that denotes {a}. Suppose r and s are regular expressions:
(r)|(s) denoting L(r) U L(s).(r)(s) denoting L(r)L(s).(r)* denoting (L(r))*
Precedence of Operator
Level of precedenceKleene clusure (*)concatenationunion (|)
All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))
Regular Definition
A sequence of definitions:d1ฎr1
d2ฎr2
...
dnฎrn
di is a distinct nameri is a regular expression over:
ฎ U {d1, …, di-1}
Examples
letter ฎ A | B | … | Z | a | b | … | z
digit ฎ 0 | 1 | … | 9
id ฎ letter ( letter | digit )*
digits ฎ digit digit*
opt_fraction ฎ . digits | opt_exponent ฎ ( E ( + | - | ) digits ) | num ฎ digits opt_fraction opt_exponent
Notational Shorthands
One or more instancesr+ = rr*
Zero or one instancer? = r | (rs)? = rs |
Character Class [A-Za-z] = A | B | … | Z | a | b | … | z
Examples
digit ฎ [0-9]
digits ฎ digit+
opt_fraction ฎ . digits )?
opt_exponent ฎ ( E ( + | - )? digits )?
num ฎ digits opt_fraction opt_exponent
id ฎ [A-Za-z][A-Za-z0-9]*
Recognition of Tokens
Consider tokens from the grammar. tokenpatternattribute
Draw NFAs with retracting options.
Example : Grammar
stmt ::= if expr then stmt
| if expr then stmt else stmt
| expr
expr ::= term relop term
| term
term ::= id | num
Example : Regular Definition
if ฎ if
then ฎ then
else ฎ else
relop ฎ < | <= | = | <> | > | >=
id ฎ letter (letter | digit)*
num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?
delimฎ blank | tab | newline
ws ฎ delim+
Example: Pattern-Token-Attribute
Attribute-Value
-
-
-
-
Index in table
Index in table
LT
LE
EQ
NE
..
Regular
Expression
ws
if
then
else
id
num
<
<=
=
<>
...
Token
-
if
then
else
id
num
relop
relop
relop
relop
...
Attributes for Tokens
if count >= 0 then ...
<if, >
<id, index for count in symbol table>
<relop, GE>
<num, integer value 0>
<then, >
NFA – Lexical Analysis Engine
0 1
6
2
3
4
5
8
7
return(relop, LE)
return(relop, EQ)
return(relop, NE)
return(relop, LT)
return(relop, GE)
return(relop, GT)
< =
>
other
=
>
=
other
*
*
Handle Numbers
Pattern for number contains options.num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?
31, 31.02, 31.02E-15
Always get the longest possible match.match the longest first if not match, try the next possible pattern.
Handle Numbers
12
19
13
return(num, getnum())
*
other
digit
14 15 16 17 18digit
digit
digitdigit
digit digit. E
E
+ or -
20 21 22 23
digitdigit
digit digit.
25 26
digit
digit
24
27
other
other
*
*
Handle Keywords
Two approaches:encode keywords into an NFA (if, then, etc.)
complex NFA (too many states).
use symbol table simple. require some tricks.
9 1110 return(gettoken(),
install_id())
*otherletter
letter or digit
Handle Keywords
Symbol table contains both lexeme and token type.
Initialize symbol table with all keywords and corresponding token types.
lexeme: if token type: if
lexeme: then token type: then
lexeme: else token type: else
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
initial
1
2
3
4
5
Handle Keywordsgettoken():
If id is not found in the table, return token type ID. Otherwise, return token type from the table.
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
gettoken
Source program(text stream)
i f c o u n t < =i f
next-token
i f
if
1
2
3
4
5
Handle Keywords install_id():
If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry.
If id is found and its type is ID, return pointer to that entry.
Otherwise, it’s a keyword. Return 0.
1
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
install_idSource program(text stream)
i f c o u n t < =i f
next-token
token if0i f
0
2
3
4
5
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
gettoken
Source program(text stream)
i f c o u n t < =i f
next-token
id
1
2
3
4
5
c o u n t
c o u n tc o u n t
Not found!
1
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
install_id
Source program(text stream)
i f c o u n t < =
next-token
token id4
4
2
3
4
5
c o u n tc o u n t
count id …