Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...

32
Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University

Transcript of Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...

Page 1: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Lexical Analysis

Natawut Nupairoj, Ph.D.

Department of Computer EngineeringChulalongkorn University

Page 2: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Outline

Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine.

Page 3: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Front-End Components

ScannerSource program(text stream)

Parser

IntermediateRepresentation(file or in memory)

SemanticAnalyzer

Front-End

Construct parse tree.

Group token.

next-token

token

SymbolTable

m a i n ( ) {

Check semantic/contextual.

identifiermain

symbol(

parse-tree

Page 4: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Tasks for Scanner

Read input and group tokens for Parser. Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions

Page 5: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Benefits

Simpler design parser doesn’t worry about comments and white spac

es.

More efficient scanner optimize the scanning process only. use specialize buffering techniques.

Portability handle standard symbols on different platforms.

Page 6: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Basic Terminology

Tokena set of stringsEx: token = identifier

Lexemea sequence of characters in the source progra

m matched by the pattern for a token.Ex: lexeme = counter

Page 7: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Basic Terminology

Pattern a description of strings that can belong to a particular

token set. Ex: pattern = letter followed by letters or digit

{A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

Page 8: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Token

const

if

relation

id

num

literal

Lexeme

const

if

<, <=, …, >=

counter, x, y

12.53, 1.42E-10

“Hello World”

Pattern

const

if

comparison symbols

letter (letter | digit)*

any numeric constant

characters between “

Page 9: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Language and Lexical Analysis

Fixed-format input i.e. FORTRANmust consider the alignment of a lexeme.difficult to scan.

No reserved words i.e. PL/Ikeywords vs. id ? -- complex rules.

if if = then then then := else; else else := then;

Page 10: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Regular Expression Revisited

is a regular expression that denotes {}. If a is an alphabet, a is a regular expressio

n that denotes {a}. Suppose r and s are regular expressions:

(r)|(s) denoting L(r) U L(s).(r)(s) denoting L(r)L(s).(r)* denoting (L(r))*

Page 11: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Precedence of Operator

Level of precedenceKleene clusure (*)concatenationunion (|)

All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))

Page 12: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Regular Definition

A sequence of definitions:d1ฎr1

d2ฎr2

...

dnฎrn

di is a distinct nameri is a regular expression over:

ฎ U {d1, …, di-1}

Page 13: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Examples

letter ฎ A | B | … | Z | a | b | … | z

digit ฎ 0 | 1 | … | 9

id ฎ letter ( letter | digit )*

digits ฎ digit digit*

opt_fraction ฎ . digits | opt_exponent ฎ ( E ( + | - | ) digits ) | num ฎ digits opt_fraction opt_exponent

Page 14: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Notational Shorthands

One or more instancesr+ = rr*

Zero or one instancer? = r | (rs)? = rs |

Character Class [A-Za-z] = A | B | … | Z | a | b | … | z

Page 15: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Examples

digit ฎ [0-9]

digits ฎ digit+

opt_fraction ฎ . digits )?

opt_exponent ฎ ( E ( + | - )? digits )?

num ฎ digits opt_fraction opt_exponent

id ฎ [A-Za-z][A-Za-z0-9]*

Page 16: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Recognition of Tokens

Consider tokens from the grammar. tokenpatternattribute

Draw NFAs with retracting options.

Page 17: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Example : Grammar

stmt ::= if expr then stmt

| if expr then stmt else stmt

| expr

expr ::= term relop term

| term

term ::= id | num

Page 18: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Example : Regular Definition

if ฎ if

then ฎ then

else ฎ else

relop ฎ < | <= | = | <> | > | >=

id ฎ letter (letter | digit)*

num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?

delimฎ blank | tab | newline

ws ฎ delim+

Page 19: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Example: Pattern-Token-Attribute

Attribute-Value

-

-

-

-

Index in table

Index in table

LT

LE

EQ

NE

..

Regular

Expression

ws

if

then

else

id

num

<

<=

=

<>

...

Token

-

if

then

else

id

num

relop

relop

relop

relop

...

Page 20: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Attributes for Tokens

if count >= 0 then ...

<if, >

<id, index for count in symbol table>

<relop, GE>

<num, integer value 0>

<then, >

Page 21: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

NFA – Lexical Analysis Engine

0 1

6

2

3

4

5

8

7

return(relop, LE)

return(relop, EQ)

return(relop, NE)

return(relop, LT)

return(relop, GE)

return(relop, GT)

< =

>

other

=

>

=

other

*

*

Page 22: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Numbers

Pattern for number contains options.num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?

31, 31.02, 31.02E-15

Always get the longest possible match.match the longest first if not match, try the next possible pattern.

Page 23: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Numbers

12

19

13

return(num, getnum())

*

other

digit

14 15 16 17 18digit

digit

digitdigit

digit digit. E

E

+ or -

20 21 22 23

digitdigit

digit digit.

25 26

digit

digit

24

27

other

other

*

*

Page 24: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords

Two approaches:encode keywords into an NFA (if, then, etc.)

complex NFA (too many states).

use symbol table simple. require some tricks.

9 1110 return(gettoken(),

install_id())

*otherletter

letter or digit

Page 25: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords

Symbol table contains both lexeme and token type.

Initialize symbol table with all keywords and corresponding token types.

lexeme: if token type: if

lexeme: then token type: then

lexeme: else token type: else

Page 26: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

initial

1

2

3

4

5

Page 27: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywordsgettoken():

If id is not found in the table, return token type ID. Otherwise, return token type from the table.

Page 28: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

gettoken

Source program(text stream)

i f c o u n t < =i f

next-token

i f

if

1

2

3

4

5

Page 29: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords install_id():

If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry.

If id is found and its type is ID, return pointer to that entry.

Otherwise, it’s a keyword. Return 0.

Page 30: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

1

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

install_idSource program(text stream)

i f c o u n t < =i f

next-token

token if0i f

0

2

3

4

5

Page 31: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

gettoken

Source program(text stream)

i f c o u n t < =i f

next-token

id

1

2

3

4

5

c o u n t

c o u n tc o u n t

Not found!

Page 32: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University.

1

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

install_id

Source program(text stream)

i f c o u n t < =

next-token

token id4

4

2

3

4

5

c o u n tc o u n t

count id …