EDAN65:Compilers,Lecture 02
Regular expressionsandscanning
GörelHedinRevised:2017-08-29
Courseoverview
Semantic analyzer
Intermediatecode generator
Optimizer
Targetcodegenerator
2EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
machine
runtime system
stack
heap
codeanddata
objects
activationrecords
Interpreter
target code
tokens
Attributed AST
intermediate code
sourcecode (text)
AST(Abstractsyntaxtree)
intermediate code
garbagecollection
Virtualmachine
This lecture
Analyzing programtext
EDAN65,Lecture02 3
sum =sum +k
AssignStmt
Exp
Add
Exp Exp
IDEQIDPLUSIDprogramtext
tokens
parse tree
This lecture
Recall:Generatingthecompiler:
Semantic analyzer
EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions Scannergenerator
Context-freegrammar
Parsergenerator
Attributegrammar
Attribute evaluatorgenerator
We will use ascannergeneratorcalled JFlex
4
tokens
text
tree
Some typical tokens
EDAN65,Lecture02 5
Token Example lexemes
IFTHENFOR
ifthenfor
ID B alpha k10
INTFLOATSTRINGCHAR
12309920163.14160.2"Hello""""100%"'A''c' '%'
PLUSINCRNE
+++!=
SEMICOMMALPAREN
;,(
Regular expression"if""then""for"[A-Za-z][A-Za-z0-9]*
[0-9]+[0-9]+ "." [0-9]+\" [^\"]* \"\' [^\'] \'"+""++""!="";"",""("
JFlex syntax
Reserved words(keywords)
Identifiers
Literals
Operators
Separators
Formallanguages• Analphabet,Σ,isasetof symbols(nonempty andfinite).• Astring isasequence of symbols(each stringisfinite)• Aformallanguage,L,isasetof strings(can beinfinite).
• We would liketo have rules oralgorithms fordefining alanguage – deciding if acertain stringoverthealphabetbelongs to thelanguage ornot.
EDAN65,Lecture02 6
Example:Languages overbinary numbers
Suppose we have thealphabet Σ ={0,1}
Example languages:• Thesetof allpossiblecombinationsofzerosandones:
L0 ={0,1,00,01,10,11,000,...}• Allbinarynumberswithoutunnecessaryleadingzeros:
L1 ={0,1,10,11,100,101,110,111,1000,...}• Allbinarynumberswithtwodigits:
L2 ={00,01,10,11}• ...
EDAN65,Lecture02 7
Example:Languages overUNICODE
Here,thealphabet Σ isthesetof UNICODEcharacters
Example languages:• Allpossible Javakeywords:{"class","import","public",...}• Allpossible lexemes corresponding to Javatokens.• Allpossible lexemes corresponding to Javawhitespace.• Allbinary numbers• ...
EDAN65,Lecture02 8
Example:Languages overJavatokens
Here,thealphabet Σ isthesetof Javatokens
Example languages:• Allsyntactically correct Javaprograms• Allthat are syntactically incorrect• Allthat are compile-time correct• Allthat terminate• ...
EDAN65,Lecture02 9
(But this language cannot becomputed:Terminationisundecidable:itisnotpossible to construct analgorithm that decides forany string,ifitisaterminating programornot.)
Defining languages using rulesIncreasingly powerful:• Regular expressions(fortokens)• Context-free grammars(forsyntaxtrees)• Attribute grammars(context-free grammar +extrarules for
further restricting thelanguage)
EDAN65,Lecture02 10
Regular expressions(core notation)RE read iscalled
a a symbol
M |N M orN alternative
MN M followed byN concatenation
∊ theempty string epsilon
M* zero ormoreM repetition(Kleene star)
(M)
EDAN65,Lecture02 11
where a isasymbolinthealphabet (e.g.,{0,1}orUNICODE)andM andN are regular expressions
Each regular expressiondefines alanguage overthealphabet(asetof stringsthat belong to thelangauge).
Priorities:M |N P*means M |(N (P*))
Example
a |b c*
means
{a,b,bc,bcc,bccc,...}
EDAN65,Lecture02 12
Regular expressions(extended notation)Core RE read iscalled
a a symbol
M |N M orN alternative
MN M followed byN concatenation
∊ theempty string epsilon
M* zero ormoreM repetition(Kleene star)
(M)
EDAN65,Lecture02 13
Extended RE read meansM+ at least one ... MM*
M? optional ... ∊ |M[aou][a-zA-Z]
one of ...(a character class) a|o|ua|b| ...|z|A|B|...|Z
[^0-9](Appel notation:~[0-9])
not... one character,but notanyone of those listed
"a+b" thestring... a\+b
ExerciseWrite aregular expressionthat defines thelanguage of alldecimalnumbers,like
3.140.7547110...
But notnumbers lacking aninteger part.Andnotnumbers with adecimalpoint butlacking afractional part.Sonotnumbers like
17..236.
Leadingandtrailing zeros are allowed.Sothefollowing are ok:
007008.000.01.700
a) Use theextended notation.b) Then translatetheexpressionto thecore notationc) Then write anexpressionthat disallows unnecessary leadingzeros
(intheextended notation)
EDAN65,Lecture02 14
Solutiona)[0-9]+ ("."[0-9]+)?
b)(0 |...| 9)(0 |...| 9)* (∊ | ("."((0 |...| 9)(0 |...| 9)*)))
c)(0 | [1-9] [0-9]*) ("."[0-9]+)?
EDAN65,Lecture02 15
Escaped characters
EDAN65,Lecture02 16
Use backslashto escape metacharacters andnon-printing control characters.
Metacharacters
\+
\*
\(
\)
\|
\\
...
Non-printing control characters
\n newline
\r return
\t tab
\f formfeed
...
Some typical tokens
EDAN65,Lecture02 17
Kind Name Example lexemes
Reserved words(keywords)
IFTHENFOR
ifthenfor
Identifiers ID B alpha k10
Literals INT 123099
FLOAT 3.14160.2
CHAR 'A''c'
STRING "Hello""""j"
Operators PLUSINCRNE
+++!=
Separators SEMICOMMALPAREN
;,(
Regular expression"if""then""for"[A-Za-z]([A-Za-z0-9])*
[0-9]+
[0-9]+ "." [0-9]+
\' [^\'] \'
\" [^\"]* \"
"+""++""!="";"",""("
Some typical non-tokens
EDAN65,Lecture02 18
Non-Token Example lexemes
WHITESPACE blank tab newlinereturn
ENDOFLINECOMMENT //comment
Regular expression(jflex)" " | \t | \n | \r
"//" [^\n\r]* ([\n\r])?
Non-tokensare also recognized bythescanner,justliketokens.But they are notsentonto theparser.
JFlex syntax
(Thenewline/return ending anend-of-line comment isoptional inorderto allow afile to endwith anend-of-line comment,without anextranewline/return.)
JFlex:AscannergeneratorGeneratingascannerforalanguage lang
EDAN65,Lecture02 19
Program.lang
LangScanner.java
LangParser.java
characters
tokens
lang.jflex jflex.jar
Scannerspecification withregular exprs
Scannergenerator
AJFlex specification
EDAN65,Lecture02 20
package lang; // the generated scanner will belong to the package langimport lang.Token; // Our own class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...
Rules andlexical actionsEach rule hastheform:
regular-expression {lexical action }Thelexical actionconsists of arbitrary Javacode.Itisrun when aregular expressionismatched.Themethod yytext()returns thelexeme (thetokenvalue).
What rules are used whenscanning"a<b"?
Ambiguities?
EDAN65,Lecture02 21
package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...
Are thetokendefinitionsambiguous?Which rules match"<="?Which rules match"if"?Which rules match"ifff"?Which rules match"xyz"?
Extrarules forresolving ambiguities
Longest matchIfone rule can beused to matchatoken,but there isanother rulethat will matchalonger token,thelatter rule will bechosen.This way,thescannerwill matchthelongest tokenpossible.
Rule priorityIftwo rules can beused to matchthesamesequence of characters,thefirst one takes priority.
EDAN65,Lecture02 22
Implementationof scannersObservation:
Regular expressions are equivalent to finite automata (finite-state machines).(They can recognize thesameclass of formallanguages:theregular languages.)
Overallapproach:• Translateeach tokenregular expressionto afinite automaton.
Label thefinalstate with thetoken.• Merge alltheautomata.• Theresulting automaton will ingeneralbenondeterministic• Translatethenondeterministic automaton to adeterministic automaton.• Implement thedeterministic automaton,
either using switchstatements oratable.
Ascannergeneratorautomates this process.
EDAN65,Lecture02 23
Construct anautomaton foreach tokenregexp
EDAN65,Lecture02 24
a
state
transition
startstate
finalstate
fi IF
0-9 INT
0-9
"" WHITESPACE
\n
\t
WHITESPACE
WHITESPACE
a-zA-Z ID
a-zA-Z
"if"
[0-9]+
""|\n|\t
[a-zA-Z]+
Merge thestartstates of theautomata
EDAN65,Lecture02
f
i
IF
0-9 INT
0-9
""\n\t
WHITESPACE
a-zA-Z
ID
a-zA-Z
Isthenewautomaton deterministic?
25
Deterministic finite automata
EDAN65,Lecture02 26
1
a 2
3a
1ε
2
1
a 2
3b
Inadeterministic finite automaton each transition isuniquely determined bytheinput.
Nondeterministic,since if we readawhen instate 1,we don't know if we should goto state 2or3.
Nondeterministic,since when we are instate 1,we don'tknow if we should stay there,orgoto state 2withoutreading any input.(Epsilondenotes theempty string.)
Deterministic,since fromstate 1,thenext inputdetermines if we goto state 2or3.
DFAversus NFADeterministic Finite Automaton (DFA)Afinite automaton isdeterministic if
– alloutgoing edges fromany givenstate have disjointcharacter sets– there are noepsilonedges
Can beimplemented efficiently
Non-deterministic Finite Automaton (NFA)AnNFAmay have
– two outgoing edges with overlapping character sets– epsilonedges
Every DFAisalso anNFA.Every NFAcan betranslated to anequivalent DFA.
EDAN65,Lecture02 27
Translating anNFAto aDFASimulate theNFA– keep track of aset of current NFA-states– follow ε edges to extend thecurrent set(take theclosure)
Construct thecorresponding DFA– Each such set of NFAstates corresponds to one DFAstate– Ifany of theNFAstates isfinal,theDFAstate isalso final,andismarked with thecorresponding token.
– Ifthere ismore than one tokento choose from,select thetokenthat isdefined first (rule priority).
(Minimize theDFAforefficiency)
EDAN65,Lecture02 28
Example
EDAN65,Lecture02 29
2 3f
iIF
1
4
a-z
ID
a-z
NFA
3,4f
iIF
1
4
a-hj-zID
a-z
DFA
a-za-eg-z
2,4ID
Error handling
EDAN65,Lecture02 30
3f
iIF
1
4a-hj-z
ID
a-za-z
a-eg-z
0
ERROR
^a-z^a-z ^a-z^a-z
• Add a"dead state"(state 0),corresponding to erroneous input.• Add transitions to the"dead state"forallerroneous input.• Generate an"ERRORtoken"when thedead state isreached.
2
ID
ImplementationalternativesforDFAs
Table-driven– Represent theautomaton byatable– Additional tableto keep track of finalstates andtokenkinds– Aglobalvariable keeps track of thecurrent state
Switchstatements– Each state isimplemented asaswitchstatement– Each case implements astate transition asajump (to another switch
statement)– Thecurrent state isrepresented bytheprogramcounter.
EDAN65,Lecture02 31
Table-drivenimplementation
EDAN65,Lecture02 32
... + ... a ... e f g ... h i j ... z ... final kind
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true ERROR
1 0 5 0 4 4 4 4 4 4 4 2 4 4 4 0 false
2 0 0 0 4 4 4 3 4 4 4 4 4 4 4 0 true ID
3 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true IF
4 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true ID
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true PLUS
3f
iIF
1
4a-hj-z
ID
a-za-z
a-eg-z
2ID
5PLUS
+
Scannerimplementation,design
EDAN65,Lecture02 33
ParserScanner
TokennextToken()
File
charnextChar()
Token
int kind()Stringvalue()
call call
Scannerimplementation,sketch
EDAN65,Lecture02 34
Token nextToken() {state = 1; // start statewhile (! isFinal[state]) {
ch = file.readChar();state = edges[state, ch];
}return new Token(kind[state]);
}
Needs to beextended with handlingof:• longest match• endof file• nontokens(likewhitespace)• tokenvalues (liketheidentifier name)
Idea:Scanthenext tokenby• starting inthestartstate• scan characters until we reach afinalstate• return anewtoken
Extend to longest match,design
EDAN65,Lecture02 35
ParserScanner
TokennextToken()
PushbackFile
charreadChar()void pushback(String)
Token
int kind()Stringvalue()
File
charreadChar()
Idea:• When atokenismatched,don't stopscanning.• When theerror state isreached,return thelasttokenmatched.• Pushreadcharacters that are unused backinto thefile,sothey can bescanned again.• Use aPushbackFile to accomplish this.
Extend to handle longest match,sketch
EDAN65,Lecture02 36
Token nextToken() {state = 1;str = "";lastFinalState = 0; lastTokenValue = "";while (state != 0) {
ch = pushbackfile.readChar();str = str + ch; state = edges[state, ch];if (isFinal[state]) {
lastFinalState = state;lastTokenValue = str;
}}pushbackfile.pushback(str.substring(lastTokenValue.length));return new Token(kind[lastFinalState], lastTokenValue);
}
// In Java, StringBuilder would be more efficient
• When atokenismatched (afinalstate reached),don’t stopscanning.• Keep track of thecurrently scanned string,str.• Keep track of thelatest matched token(lastFinalState,lastTokenValue).• Continue scanninguntil we reach theerror state.• Restore theinputstream using PushBackFile.• Return thelatest matched token.• (orreturn theERRORtokenif there was nolatest matched token)
HandlingEnd-of-file (EOF)andnon-tokens
EOF– construct anexplicitEOFtokenwhen theEOFcharacter isread
Non-tokens(Whitespace&Comments)– view astokensof aspecialkind– scan them asnormaltokens,but don’t create tokenobjects forthem– loopinnext()until arealtokenhasbeen found
Errors– construct anexplicitERRORtokento bereturned when novalidtoken
can befound.
EDAN65,Lecture02 37
Specifying EOFandERRORinJFlex
EDAN65,Lecture02 38
package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...
// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }
// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...<<EOF>> { return new Token("EOF"); }[^] { return new Token("ERROR"); }
<<EOF>>isaspecialregular expressioninJFlex,matching endof file.
[^]means any character.Due to rule priority,this will matchany character notmatched byprevious rules.
Example scannergenerators
EDAN65,Lecture02 39
tool author generates
lex Schmidt, Lesk.1975 C-code
flex ("fast lex") Paxon.1987 C-code
jlex Javacode
jflex Javacode
...
Limitationsof regular expressionsforscanning
EDAN65,Lecture02 40
• Nested comments?• Layout-sensitivesyntax?• Context-sensitivetokendefinitions?
Forexample,multi-language documents.
• Two mechanisms inscannergeneratorsforworkarounds:– Lexical actions:
domore than create atoken,e.g.,count nesting levels of comments.– Lexical states:
switchbetween differentsetsof tokendefinitions.
Lexical states
EDAN65,Lecture02 41
• Some tokensare difficult orimpossible to define with regular expressions.
• Lexical states (setsof tokenrules)give thepossibility to switchtokensets(DFAs)during scanning.
• Useful formulti-line comments,HTML,scanningmulti-languagedocuments,etc.
• Supported bymany scannergenerators(including JFlex)
T1T2T3T4...
LEXSTATE1T5T6T7...
LEXSTATE2
Example:multi-line comments
EDAN65,Lecture02 42
Would liketo scan thecomplete comment asone token:
/*int m() {
return 15 / 3 * 4 * 2;}*/
Can besolved easily with lexical states:
ID"if"
"then""/*"...
"*/"[^]
Defaulttokenset
Tokensetusedinsidecomment
However,some scannergenerators,likeJFlex,hasthespecialoperatorupto (~)thatcan beused instead: "/*" ~"*/" { /* Comment */ }
"/*"((\*+[^/*])|([^*]))*\**"*/"
Writinganordinary regular expressionforthis isdifficult:
Courseoverview
Semantic analyzer
Intermediatecode generator
Optimizer
Targetcodegenerator
43EDAN65,Lecture02
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
machine
runtime system
stack
heap
codeanddata
objects
activationrecords
Interpreter
target code
tokens
Attributed AST
intermediate code
sourcecode (text)
AST(Abstractsyntaxtree)
intermediate code
garbagecollection
Virtualmachine
This lecture
Next lecture
A1
A1
Summaryquestions
44
• What isaformallanguage?• What isaregular expression?• What ismeant byanambiguous lexical definition?• Give some typical examples of ambiguities andhow they may beresolved.• What isalexical action?• Give anexample of how to construct anNFAforagivenlexical definition• Give anexample of how to construct aDFAforagivenNFA• What isthedifference between aDFAandandNFA?• Give anexample of how to implement aDFAinJava.• How isrule priority handledintheimplementation?Longest match?EOF?Whitespace?Errors?• What are lexical states?When are they useful?
EDAN65,Lecture02
You can startonAssignment 1now.But you will have to wait until thenext lectureforthepartsabout parsing.
Top Related