Lexical Analysis - Information Sciences...
Transcript of Lexical Analysis - Information Sciences...
1
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Lexical Analysis
Introduction
Copyright 2010, Pedro C. Diniz, all rights reserved.Students enrolled in the Compilers class at the University of Southern Californiahave explicit permission to make copies of these materials for their personal use.
2
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Front End
The purpose of the front end is to deal with the input language• Perform a membership test: code ∈ source language?• Is the program well-formed (semantically) ?• Build an IR version of the code for the rest of the compiler
The front end is not monolithic
Sourcecode
FrontEnd
Errors
Machinecode
BackEnd
IR
3
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Front End
Scanner• Maps stream of characters into words
– Basic unit of syntax– x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; >
• Characters that form a word are its lexeme• Its part of speech (or syntactic category) is called its token type• Scanner discards white space & (often) comments
Speed is an issue inscanning⇒ use a specializedrecognizer
Sourcecode Scanner
IRParser
Errors
tokens
4
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Front End
Parser• Checks stream of classified words (parts of speech) for
grammatical correctness• Determines if code is syntactically well-formed• Guides checking at deeper levels than syntax• Builds an IR representation of the code
We’ll come back to parsing in a couple of lectures
Sourcecode Scanner
IRParser
Errors
tokens
5
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Big Picture
• Language syntax is specified with parts of speech, notwords
• Syntax checking matches parts of speech against agrammar
1. goal → expr
2. expr → expr op term
3. | term4. term → number
5. | id6. op → +
7. | –
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
6
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Big Picture
No words here!Parts of speech,not words!
• Language syntax is specified with parts of speech, notwords
• Syntax checking matches parts of speech against agrammar
1. goal → expr
2. expr → expr op term
3. | term4. term → number
5. | id6. op → +
7. | –
S = goal
T = { number, id, +, - }
N = { goal, expr, term, op }
P = { 1, 2, 3, 4, 5, 6, 7}
2
7
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
The Big Picture
Why study Lexical Analysis?• We want to avoid writing scanners by hand• We want to harness the theory classes
• Goals:– To simplify specification & implementation of scanners– To understand the underlying techniques and technologies
Scanner
ScannerGenerator
specifications
source code parts of speech &words
tablesor code
Specifications written as“regular expressions”
Representwords asindices into aglobal table
8
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
What is a Lexical Analyzer?
• Example of Tokens• Operators = + - > ( { := == <>• Keywords if while for int double• Numeric literals 43 4.565 -3.6e10 0x13F3A• Character literals ‘a’ ‘~’ ‘\’’• String literals “4.565” “Fall 10” “\”\” = empty”
• Example of non-tokens• White space space(‘ ‘) tab(‘\t’) end-of-line(‘\n’)• Comments /*this is not a token*/
Source program text Tokens
9
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
10
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
11
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
12
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
3
13
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
14
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
15
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
16
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
17
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key
18
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”)for_key
4
19
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”)for_key
20
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”)for_key
21
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_opfor_key
22
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_opfor_key
23
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_opfor_key
24
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_opfor_key
5
25
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_op Num(10)for_key
26
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_op Num(10)for_key
27
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_op Num(10)for_key
28
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
ID(“var1”) eq_op Num(10)for_key
29
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for_key ID(“var1”) eq_op Num(10)
30
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10)
6
31
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10) ID(“var1”)
32
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10) ID(“var1”)
33
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10) ID(“var1”)
34
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10) ID(“var1”) le_op
35
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
f o r v a r 1 = 1 0 v a r 1 < =
Lexical Analyzer in Action
for ID(“var1”) eq_op Num(10) ID(“var1”) le_op
36
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
• Partition Input Program Text into Subsequence ofCharacters Corresponding to Tokens
• Attach the Corresponding Attributes to the Tokens
• Eliminate White Space and Comments
Lexical Analyzer needs to…
7
37
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Lexical Analysis: Basic Issues
• How to Precisely Match Strings to Tokens
• How to Implement a Lexical Analyzer
38
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Regular ExpressionsLexical patterns form a regular language
*** any finite language is regular ***
Regular expressions (REs) describe regular languages
Regular Expression (over alphabet Σ)
• ε is a RE denoting the set {ε}
• If a is in Σ, then a is a RE denoting {a}
• If x and y are REs denoting L(x) and L(y) then– x |y is an RE denoting L(x) ∪ L(y)
– xy is an RE denoting L(x)L(y)– x* is an RE denoting L(x)*
Precedence :closure, thenconcatenation,then alternation
39
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
These definitions should be well known
Set Operations (review)
Operation Definition
Union of L and MWritten L ! M
L ! M = {s | s " L or s " M }
Concatenation of Land M
Written LM
LM = {st | s " L and t " M }
Kleene closure of LWritten L*
L* = !0#i#$ Li
Positive Closure of LWritten L+
L+ = !1#i#$ Li
40
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Examples of Regular Expressions
Identifiers:Letter → (a|b|c| … |z|A|B|C| … |Z)Digit → (0|1|2| … |9)Identifier → Letter ( Letter | Digit )*
Numbers:Integer → (+|-|ε) (0| (1|2|3| … |9)(Digit *) )
Decimal → Integer . Digit *
Real → ( Integer | Decimal ) E (+|-|ε) Digit *
Complex → ( Real , Real )
Numbers can get much more complicated!
41
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Regular Expressions (the point)
Regular expressions can be used to specify the words to be translated toparts of speech by a lexical analyzer
• Using results from automata theory and theory of algorithms,we can automatically build recognizers from regularexpressions
• Some of you may have seen this construction for stringpattern matching
⇒ We study REs and associated theory to automate scannerconstruction !
42
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Consider the problem of recognizing Register namesRegister → r (0|1|2| … | 9) (0|1|2| … | 9)*
• Allows registers of arbitrary number• Requires at least one digit
RE corresponds to a recognizer (or DFA)
Transitions on other inputs go to an error state, se
Example
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
8
43
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
DFA operation• Start in state S0 & take transitions on each input character• DFA accepts a word x iff x leaves it in a final state (S2 )
So,• r17 takes it through s0, s1, s2 and accepts• r takes it through s0, s1 and fails• a takes it straight to se
S0 S2 S1
r
(0|1|2| … 9)
accepting state
(0|1|2| … 9)
Recognizer for Register
Example (continued)
44
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
To be useful, recognizer must turn into code
sesesese
ses2ses2
ses2ses1
seses1s0
Allothers
0,1,2,3,4,5,6,7,8,9rδChar ← next character
State ← s0
while (Char ≠ EOF) State ← δ(State,Char) Char ← next character
if (State is a final state ) then report success else report failure
Skeleton recognizer Table encoding RE
Example (continued)
45
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
To be useful, recognizer must turn into code
seerror
seerror
seerror
se
seerror
s2add
seerror
s2
seerror
s2add
seerror
s1
seerror
seerror
s1start
s0
Allothers
0,1,2,3,4,5,6,7,8,9rδChar ← next character
State ← s0
while (Char ≠ EOF) State ← δ(State,Char) perform specified action Char ← next character
if (State is a final state ) then report success else report failure
Skeleton recognizer
Table encoding RE
Example (continued)
46
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
r Digit Digit* allows arbitrary numbers• Accepts r00000• Accepts r99999• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression– Register → r ( (0|1|2) (Digit | ε) | (4|5|6|7|8|9) | (3|30|31) )– Register → r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA
• Has more states• Same cost per transition• Same basic implementation
What about a Tighter Specification?
47
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Tighter Register Specification (cont’d)
The DFA forRegister → r ( (0|1|2) (Digit | ε) | (4|5|6|7|8|9) | (3|30|31) )
• Accepts a more constrained set of registers
• Same set of actions, more states
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
(0|1|2| … 9)
48
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Tighter Register Specification (cont’d)
seseseseses1s0
sesesesesesese
seseseseseses6
seseseses6ses5
seseseseseses4
seseseseseses3
ses3s3s3s3ses2
ses4s5s2s2ses1
Allothers4-9320,1rδ
Table encoding RE for the tighter register specification
Runs in thesameskeletonrecognizer
9
49
Spring 2010CSCI 565 - Compiler Design
Pedro [email protected]
Summary
• The role of the Lexical Analyzer– Partition input stream into tokens
• Regular Expressions– Used to describe structure of tokens
• DFA: Deterministic Finite Automata– Machinery to recognize Regular Languages