Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer...

21
Lexical Analysis The Scanner Scanner 1

Transcript of Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer...

Page 1: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

1

Lexical AnalysisThe Scanner

Scanner

Page 2: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

2

Introduction

• A scanner, sometimes called a lexical analyzer• A scanner :

– gets a stream of characters (source program)– divides it into tokens

• Tokens are units that are meaningful in the source language.

• Lexemes are strings which match the patterns of tokens.

Scanner

Page 3: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

3

Examples of Tokens in C

Scanner

Tokens Lexemes identifier Age, grade,Temp, zone, q1

number 3.1416, -498127,987.76412097

string “A cat sat on a mat.”, “90183654”

open parentheses (

close parentheses )

Semicolon ;

reserved word if IF, if, If, iF

Page 4: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

4

Scanning

• When a token is found:– It is passed to the next phase of compiler. – Sometimes values associated with the token, called

attributes, need to be calculated.– Some tokens, together with their attributes, must

be stored in the symbol/literal table.• it is necessary to check if the token is already in the table

• Examples of attributes– Attributes of a variable are name, address, type, etc.– An attribute of a numeric constant is its value.

Scanner

Page 5: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

5

How to construct a scanner

• Define tokens in the source language.• Describe the patterns allowed for tokens.• Write regular expressions describing the patterns.• Construct an FA for each pattern.• Combine all FA’s which results in an NFA.• Convert NFA into DFA• Write a program simulating the DFA.

Scanner

Page 6: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

6

Regular Expression

• a character or symbol in the alphabet• an empty string• an empty set• if r and s are regular expressions

– r | s– r s– r * – (r )

Scanner

fl

Page 7: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

7

Extension of regular expr.

• [a-z] – any character in a range from a to z

• .– any character

• r + – one or more repetition

• r ? – optional subexpression

• ~(a | b | c), [^abc]– any single character NOT in the set

Scanner

Page 8: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

8

Examples of Patterns

• (a | A) = the set {a, A}• [0-9]+ = (0 |1 |...| 9) (0 |1 |...| 9)*• (0-9)? = (0 | 1 |...| 9 | )• [A-Za-z] = (A |B |...| Z |a |b |...| z)• A . = the string with A following by any one

symbol• ~[0-9] = [^0123456789] = any character which

is not 0, 1, ..., 9

Scanner

l

Page 9: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

9

Describing Patterns of Tokens

• reservedIF = (IF| if| If| iF) = (I|i)(F|f)• letter = [a-zA-Z]• digit =[0-9]• identifier = letter (letter|digit)*• numeric = (+|-)? digit+ (. digit+)? (E (+|-)? digit+)?• Comments

– { (~})* } // from tiny C grammar– /* ([^*]*[^/]*)* */ // C-style comments– ;(~newline)* newline // Assembly lang comments

Scanner

Page 10: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

10

Disambiguating Rules

• IF is an identifier or a reserved word?– A reserved word cannot be used as identifier.– A keyword can also be identifier.

• <= is < and = or <=?– Principle of longest substring

• When a string can be either a single token or a sequence of tokens, single-token interpretation is preferred.

Scanner

Page 11: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

11

Nondeterministic Finite AutomataA nondeterministic finite automaton (NFA) is a

mathematical model that consists of1. A set of states S2. A set of input symbols S3. A transition function that maps state/symbol pairs to a

set of states: S x {S + e} set of S

4. A special state s0 called the start state5. A set of states F (subset of S) of final statesINPUT: stringOUTPUT: yes or no

Page 12: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

12

STATE a b e

0 0,1 0 3

1 2

2 3

3

Transition Table:

0 1 2 3

a,b

a b b

e

S = {0,1,2,3}S0 = 0S = {a,b}F = {3}

Example NFA

Page 13: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

13

NFA Execution

An NFA says ‘yes’ for an input string if there is some path from the start state to some final state where all input has been processed.

NFA(int s0, int input_element) { if (all input processed and s0 is a final state) return Yes; if (all input processed and s0 is not a final state) return No;

for all states s1 where transition(s0,table[input_element]) = s1

if (NFA(s1,input_element+1) = = Yes) return Yes;

for all states s1 where transition(s0,e) = s1

if (NFA(s1,input_element) = = Yes) return Yes; return No;}

Uses backtracking to search all possible paths

Page 14: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

14

Deterministic Finite AutomataA deterministic finite automaton (DFA) is a mathematical

model that consists of1. A set of states S2. A set of input symbols S3. A transition function that maps state/symbol pairs to a

state: S x S S

4. A special state s0 called the start state5. A set of states F (subset of S) of final statesINPUT: stringOUTPUT: yes or no

Page 15: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

15

FA Recognizing Tokens• Identifier• Numeric

• Comment

Scanner

~/

/ ** /

~*

E

digit digit

digitdigit

digit. E +,-,e

digit

+,-,e

letterletter,digit

Page 16: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

16

Examples from textbook

• Section 2.3• identifier = letter(letter|digit)*

Scanner

Page 17: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

17

Combining FA’s• Identifiers• Reserved words

• Combined

Scanner

I,i F,f E,e L,l S,s E,e

other letter letter,digit

E,e L,l S,s E,e

I,i F,f

letterletter,digit

Page 18: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

18

Lookahead

Scanner

I,i F,f

[other]

letter,digit Return ID

Return IF

Page 19: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

19

Implementing DFA

• nested-if• transition table

Scanner

letter,digit

E,e L,l S,s E,e

I,i F,f [other]

[other]

[other]

Return IF

Return ID

Return ELSE

Page 20: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

20

Nested IFswitch (state){ case 0: { if isletter(nxt) state=1; elseif isdigit(nxt) state=2; else state=3; break; } case 1: { if isletVdig(nxt) state=1; else state=4; break; } …}

Scanner

0

1

2

4

3le

tter

digit

other

letter,digit

other

Page 21: Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.

21

Transition table

Stch

0 1 2 3 …

letter 1 1 .. ..

digit 2 1 .. ..

… 3 4

..Scanner

0

1

2

4

3le

tter

digit

other

letter,digit

other