EDAN65:Compilers,Lecture 03
Context-free grammars,introductionto parsingGörelHedinRevised:2017-09-04
Courseoverview
Semantic analyzer
Intermediatecode generator
Optimizer
Targetcodegenerator
2
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
machine
runtime system
stack
heap
codeanddata
objects
activationrecords
Interpreter
target code
tokens
Attributed AST
intermediate code
sourcecode (text)
AST(Abstractsyntaxtree)
intermediate code
garbagecollection
Virtualmachine
This lecture
Analyzing programtext
3
sum = sum + k ;
AssignStmt
Exp
Add
Exp Exp
ID EQ ID PLUS ID SEMIprogram text
tokens
parse tree
Recall:Generatingthecompiler:
Semantic analyzer
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
ScannergeneratorJFlex
Context-freegrammar
ParsergeneratorBeaver
Attributegrammar
Attribute evaluatorgenerator
We will use aparsergeneratorcalled Beaver
4
tokens
text
tree
5
Context-Free Grammars
Regular ExpressionsvsContext-Free Grammars
6
AnREcan have iteration
ACFGcan also have recursion(itispossible toderive asymbol,e.g.,Stmt,fromitself)
Example REs:WHILE = "while"ID = [a-z][a-z0-9]*LPAR = "("RPAR = ")"PLUS = "+"...
Example CFG:Stmt –> WhileStmtStmt –> AssignStmtWhileStmt –> WHILE LPAR Exp RPAR StmtExp –> IDExp –> Exp PLUS Exp...
ElementsofaContext-Free Grammar
7
Production rules:X –> s1 s2 … sn
where sk isasymbol(terminalornonterminal)
Nonterminal symbols
Terminalsymbols(tokens)
Startsymbol(one ofthenonterminals,usually theleft-handside of thefirst production)
Example CFG:Stmt –> WhileStmtStmt –> AssignStmtWhileStmt –> WHILE LPAR Exp RPAR StmtAssignStmt –> ID EQ Exp SEMIC…
Shorthand foralternatives
8
Stmt –> WhileStmtStmt –> AssignStmt
Stmt –> WhileStmt | AssignStmt
isequivalent to
Shorthand forrepetition
9
Stmt*
StmtList –> e | Stmt StmtList
isequivalent to
StmtList
where
ExerciseConstruct agrammar covering this programandsimilar ones:
10
Example program:while (k <= n) {sum = sum + k; k = k+1;}
CFG:Stmt –> WhileStmt | AssignStmt | CompoundStmtWhileStmt –> "while" "(" Exp ")" StmtAssignStmt –> ID "=" Exp ";"CompoundStmt –> ...Exp –> ...LessEq –> ...Add –> ...
(Often,simpletokensarewritten directly astextstrings)
SolutionConstruct agrammar covering this programandsimilar ones:
11
CFG:Stmt –> WhileStmt | AssignStmt | CompoundStmtWhileStmt –> "while" "(" Exp ")" StmtAssignStmt –> ID "=" Exp ";"CompoundStmt –> "{" Stmt* "}"Exp –> LessEq | Add | ID | INTLessEq –> Exp "<=" ExpAdd –> Exp "+" Exp
Example program:while (k <= n) {sum = sum + k; k = k+1;}
ParsingUse thegrammar toderive atree foraprogram:
12sum = sum + k ;
StmtExample program:sum = sum + k;
Startsymbol
Parse treeUse thegrammar toderive aparse tree foraprogram:
13sum = sum + k ;
Stmt
AssignStmt
Exp
Add
Exp Exp
Example program:sum = sum + k;
Nonterminalsareinnernodes
Startsymbol
Terminalsareleafs
Aparse tree includes allthetokensasleafs.
Corresponding abstractsyntaxtree(will bediscussed inlaterlecture)
14sum = sum + k ;
AssignStmt
Add
IdExp IdExp
Example program:sum = sum + k;
IdExp
Anabstractsyntaxtree issimilarto aparse tree,but simpler.
Itdoes notinclude allthetokens.
EBNFvsCanonical Form
15
EBNF:Stmt –> AssignStmt | CompoundStmtAssignStmt –> ID "=" Exp ";"CompoundStmt –> "{" Stmt* "}"Exp –> Add | IDAdd –> Exp "+" Exp
Canonical form:Stmt –> ID "=" Exp ";"Stmt –> "{" Stmts "}"Stmts –> eStmts –> Stmt StmtsExp –> Exp "+" ExpExp –> ID
(Extended)Backus-Naur Form:• Compact,easytoreadandwrite• EBNFhasalternatives,repetition,optionals,parentheses (likeREs)
• Commonnotationforpracticaluse
Canonical form:• Core formalismforCFGs• Useful forproving properties andexplaining algorithms
Realworldexample:TheJavaLanguageSpecification
16
See http://docs.oracle.com/javase/specs/jls/se8/html/index.html• See Chapter 2about theJavagrammar notation.• Lookatsome other chapters to see other syntaxexamples.
CompilationUnit:[PackageDeclaration]{ImportDeclaration}{TypeDeclaration}
PackageDeclaration:{PackageModifier}package Identifier {.Identifier};
PackageModifier:Annotation
…
17
Formaldefinitionof CFGs
FormaldefinitionofCFGs (canonical form)
18
Acontext-free grammar G=(N,T,P,S),whereN– thesetofnonterminalsymbolsT– thesetofterminalsymbolsP– thesetofproduction rules,each withtheform
X–>Y1 Y2 …Ynwhere X∈ N,n≥ 0,andYk∈ N∪ T
S– thestartsymbol(one ofthenonterminals).I.e.,S∈ N
So,theleft-hand side Xofarule isanonterminal.
Andtheright-hand side Y1 Y2 …Yn isasequence of nonterminalsandterminals.
If therhs foraproduction isempty,i.e.,n=0,we writeX–>e
AgrammarG defines alanguageL(G)
19
A context-free grammar G=(N,T,P,S),whereN– thesetofnonterminalsymbolsT– thesetofterminalsymbolsP– thesetofproduction rules,each withtheform
X–>Y1 Y2 …Ynwhere X∈ N,n≥ 0,andYk∈ N∪ T
S– thestartsymbol(one ofthenonterminals).I.e.,S∈ N
Gdefines alanguage L(G) overthealphabet T
T*isthesetofallpossible sequences ofTsymbols.
L(G)isthesubsetofT*thatcan bederived fromthestartsymbolS,byfollowing theproduction rules P.
Exercise
20
G = (N, T, P, S)
P = {Stmt –> ID "=" Exp ";",Stmt –> "{" Stmts "}" ,Stmts –> e ,Stmts –> Stmt Stmts ,Exp –> Exp "+" Exp ,Exp –> ID
}
N = { }
T = { }
S =
L(G) = {
}
Solution
21
G = (N, T, P, S)
P = {Stmt –> ID "=" Exp ";",Stmt –> "{" Stmts "}" ,Stmts –> e ,Stmts –> Stmt Stmts ,Exp –> Exp "+" Exp ,Exp –> ID
}
N = {Stmt, Exp, Stmts}
T = {ID, "=", "{", "}", ";", "+"}
S = Stmt
L(G) = {"{" "}","{" "{" "}" "}",ID "=" ID ";","{" ID "=" ID ";" "}",ID "=" ID "+" ID ";","{" "{" "}" "{" "}" "}","{" "{" "{" "}" "}" "}","{" ID "=" ID "+" ID ";" "}",ID "=" ID "+" ID "+" ID ";",...
}
Thesequences inL(G)areusually called sentences orstrings
22
Derivations
Derivationstep
23
If we have asequence ofterminalsandnonterminals,e.g.,
XaYY b
we can replace one ofthenonterminals,applying aproductionrule.Thisiscalled aderivationstep.(Swedish:Härledningssteg)
Supposethere isaproduction
Y–>Xa
andwe apply itforthefirstYinthesequence.We write thederivationstepasfollows:
XaYY b=>XaXaYb
Derivation
24
Aderivation,issimply asequence ofderivationsteps,e.g.:
g0 =>g1 =>…=>gn (n≥0)
where each gi isasequence ofterminalsandnonterminals
Ifthere isaderivationfromg0 togn,we can write thisas
g0 =>*gn
Sothismeans itispossible togetfromthesequence g0 tothesequence gn byfollowing theproduction rules.
Definitionofthelanguage L(G)
25
Recall that:
G=(N,T,P,S)
T* isthesetofallpossible sequences ofT symbols.
L(G)isthesubsetofT*thatcan bederived fromthestartsymbol S,byfollowing theproduction rules P.
Using theconcept ofderivations,we can formally define L(G) asfollows:
L(G)={w∈ T*| S=>*w}
Exercise:Prove thatasentence belongs toalanguage
26
Prove that
INT+INT*INT
Proof (byshowing allthederivationstepsfromthestartsymbolExp):
Exp=>
belongs tothelanguage ofthefollowing grammar:
p1: Exp –>Exp "+" Expp2: Exp –>Exp "*" Expp3: Exp –> INT
Solution:Prove thatasentence belongs toalanguage
27
Prove that
INT+INT*INT
Proof:(byshowing allthederivationstepsfromthestartsymbolExp)
Exp=>p1 Exp "+" Exp=>p3 INT"+"Exp=>p2 INT"+"Exp "*"Exp=>p3 INT"+"INT"*"Exp=>p3 INT"+"INT"*"INT
belongs tothelanguage ofthefollowing grammar:
p1: Exp –>Exp "+" Expp2: Exp –>Exp "*" Expp3: Exp –> INT
Leftmost andrightmost derivations
28
Inaleftmost derivation,theleftmost nonterminalisreplacedineach derivationstep,e.g.,:
Exp =>Exp "+" Exp =>INT"+"Exp =>INT"+"Exp "*"Exp =>INT"+"INT"*"Exp =>INT"+"INT"*"INT
LLparsingalgorithms use leftmost derivation.LRparsingalgorithms use rightmost derivation.Willbediscussed inlaterlectures.
Inarightmost derivation,therightmost nonterminalisreplaced ineach derivationstep,e.g.,:
Exp =>Exp "+" Exp =>Exp "+"Exp "*"Exp =>Exp "+"Exp "*"INT=>Exp "+"INT"*"INT=>INT"+"INT"*"INT
Aderivationcorresponds tobuilding aparse tree
29
Grammar:Exp –>Exp "+" ExpExp –>Exp "*" ExpExp –> INT
Example derivation:
Exp =>Exp "+" Exp =>INT"+"Exp =>INT"+"Exp "*"Exp =>INT"+"INT"*"Exp =>INT"+"INT"*"INT
Exercise:build theparse tree(also called derivationtree).
Aderivationcorresponds tobuilding aparse tree
30
Grammar:Exp –>Exp "+" ExpExp –>Exp "*" ExpExp –> INT
Example derivation:
Exp =>Exp "+" Exp =>INT"+"Exp =>INT"+"Exp "*"Exp =>INT"+"INT"*"Exp =>INT"+"INT"*"INT
Parse tree (derivationtree):
Exp
Exp Exp
Exp Exp
"+"
INT"*"
INT INT
31
Ambiguities
Exercise:Canwe do another derivationofthesamesentence,
thatgivesadifferentparse tree?
32
Grammar:Exp –>Exp "+" ExpExp –>Exp "*" ExpExp –> INT
Parse tree:
Anotherderivation:
Exp =>
Solution:Canwe do another derivationofthesamesentence,
thatgivesadifferentparse tree?
33
Grammar:Exp –>Exp "+" ExpExp –>Exp "*" ExpExp –> INT
Parse tree:
Exp
Exp"*"
INT
Exp
Exp Exp"+"
INT INT
Anotherderivation:
Exp =>Exp "*" Exp =>Exp "+"Exp "*"Exp =>INT"+"Exp "*"Exp =>INT"+"INT"*"Exp =>INT"+"INT"*"INT
Which parse tree would we prefer?
Ambiguous context-freegrammars
34
ACFGisambiguous if asentence inthelanguage can bederived bytwo (ormore)differentparse trees.
ACFGisunambiguous if each sentence inthelanguage canbederived byonly one parse tree.
(Swedish:tvetydig,otvetydig)
Note!There can bemany differentderivationsthat give thesameparse tree.
HowcanweknowifaCFGisambiguous?
35
Ifwe find anexample of anambiguity,we know thegrammar isambiguous.
There are algorithms fordeciding if aCFGbelongs to certainsubsets of CFGs,e.g.LL,LR,etc.(See laterlectures.)Thesegrammarsare unambiguous.
But inthegeneralcase,theproblemisundecidable:itisnotpossible to construct ageneralalgorithm that decidesambiguity foranarbitrary CFG.
Strategies foreliminating ambiguities,next lecture.
36
Parsing
Differentparsingalgorithms
37
Ambiguous
Unambiguous
Allcontext-freegrammars
LR
LL
LL:Left-to-rightscanLeftmost derivationBuilds tree top-downSimpleto understand
LR:Left-to-rightscanRightmost derivationBuilds tree bottom-upMore powerful
LLandLRparsers:main idea
38
...if IDthen ID=ID;ID...
LR(1):decidestobuildAssignafterseeingthefirsttokenfollowingitssubtree.Thetreeisbuiltbottomup.
Id Assign
Id Id
Thetokeniscalled lookahead.LL(k)andLR(k)use k lookahead tokens.
...if IDthen ID=ID;ID...
IfStmt
Id Assign
LL(1):decidestobuildAssignafterseeingthefirsttokenofitssubtree.Thetreeisbuilttopdown.
CompoundStmt
Recursive-descentparsingAwayofprogramminganLL(1)parserbyrecursivemethodcalls
39
Assume aBNFgrammar with exactly one production rule foreach nonterminal.(Can easily begeneralized to EBNF.)
Each production rule RHSiseither1. asequence of token/nonterminal symbols,or2. asetof nonterminal symbolalternatives
Foreach nonterminal,amethod isconstructed.Themethod1. matches tokensandcallsnonterminal methods,or2. callsone of thenonterminal methods – which one depends onthe
lookahead token.
Ifthelookahead tokendoes notmatch,aparsingerror isreported.
A–>B|C|DB–>eCfDC–>...D–>...
ExampleJavaimplementation:overview
40
statement –>assignment |compoundStmtassignment–>IDASSIGN expr SEMICOLONcompoundStmt –>LBRACE statement*RBRACE...
class Parser{privateint token; //current lookahead tokenvoid accept(int t){...} //accepttandreadinnext tokenvoid error(Stringstr){...} //generate error messagevoid statement(){...}void assignment (){...}void compoundStmt (){...}...
}
Example:recursivedescentmethods
41
statement –>assignment |compoundStmtassignment–>IDASSIGN expr SEMICOLONcompoundStmt –>LBRACE statement*RBRACE
class Parser{void statement(){switch(token){case ID:assignment();break;case LBRACE:compoundStmt();break;default:error("Expecting statement,found:"+token);}}void assignment(){accept(ID);accept(ASSIGN);expr();accept(SEMICOLON);}void compoundStmt(){accept(LBRACE);while (token!=RBRACE){statement();}accept(RBRACE);}...}
Example:Parserskeletondetails
42
statement –>assignment |compoundStmtassignment–>IDASSIGN expr SEMICOLONcompoundStmt –>LBRACE statement*RBRACEexpr –>...
class Parser{finalstatic int ID=1,WHILE=2,DO=3,ASSIGN=4,...;privateint token; //current lookahead tokenvoid accept(int t){ //accepttandreadinnext tokenif (token==t){token=nextToken();}else {error("Expected "+t+",but found "+token);}}void error(Stringstr){...} //generate error messageprivateint nextToken(){...}//readnext tokenfromscannervoid statement()......
}
ArethesegrammarsLL(1)?
43
expr –>name params |name
Whatwouldhappeninarecursive-descentparser?
Could they beLL(2)?LL(k)?
Commonprefix
expr –>expr "+" term Left recursion
Dealingwithcommonprefixoflimitedlength:Locallookahead
44
LL(2)grammar:statement –>assignment |compoundStmt |callStmtassignment–>IDASSIGN expr SEMICOLONcompoundStmt –>LBRACE statement*RBRACEcallStmt –> IDLPAR expr RPAR SEMICOLON
voidstatement()...
45
LL(2)grammar:statement –>assignment |compoundStmt |callStmtassignment–>IDASSIGN expr SEMICOLONcompoundStmt –>LBRACE statement*RBRACEcallStmt –> IDLPAR expr RPAR SEMICOLON
voidstatement(){switch(token){caseID:if(lookahead(2) ==ASSIGN){assignment();}else{callStmt();}break;caseLBRACE:compoundStmt();break;default:error("Expectingstatement,found:"+token);}}
Dealingwithcommonprefixoflimitedlength:Locallookahead
Generatingtheparser:
Syntactic analyzer(parser)
Context-freegrammar Parsergenerator
46
tokens
tree
Beaver:anLR-based parsergenerator
ParserinJava
Context-freegrammar,
with semanticactionsinJava
Beaver
47
tokens
tree
Example beaver specification
48
%class "LangParser";%package "lang";...%terminalsLET,IN,END,ASSIGN,MUL,ID,NUMERAL;
%goal program;//Thestartsymbol
//Context-free grammarprogram=exp;exp =factor |exp MULfactor;factor =let |numeral |id;let =LETidASSIGNexp INexp END;numeral =NUMERAL;id=ID;
Lateron,we will extend this specification with semantic actionsto build thesyntaxtree.
49
RE CFGTypicalAlphabet
characters terminalsymbols(tokens)
Language isasetof ...
strings(charsequences)
sentences(tokensequences)
Used for... tokens parsetreesPower iteration recursionRecognizer DFA DFAwith stack
RegularExpressionsvsContext-FreeGrammars
50
Grammar Rulepatterns Typeregular X –>aY orX –>a orX –>e 3
contextfree X–> g 2context sensitive a X b –>a g b 1
arbitrary g –>d 0
TheChomskyhierarchy of formalgrammars
a – terminalsymbola, b, g, d – sequences of (terminalornonterminal)symbols
Type(3)⊂ Type (2)⊂ Type(1)⊂ Type(0)
Regular grammarshave thesamepower asregular expressions(tail recursion =iteration).
Type 2and3are of practicaluse incompiler construction.Type 0and1are only of theoretical interest.
Courseoverview
Semantic analyzer
51
Lexical analyzer(scanner)
Syntactic analyzer(parser)
Regularexpressions
Context-freegrammar
Attributegrammar
tokens
sourcecode (text)
AST(Abstractsyntaxtree)
What we have covered:Context-free grammars,derivations,parse treesAmbiguous grammarsIntroduction to parsing,recursive-descent
You can now finishassignment 1
Summary questions
52
• Construct aCFGforasimplepartof aprogramming language.• What isanonterminal symbol?Aterminalsymbol?Aproduction?Astartsymbol?Aparse tree?• What isaleft-handside of aproduction?Aright-handside?• Givenagrammar G,what ismeant bythelanguage L(G)?• What isaderivationstep?Aderivation?Aleftmost derivation?Arighmostderivation?• How does aderivationcorrespond to aparse tree?• What does itmean foragrammar to beambiguous?Unambiguous?• Give anexample anambiguous CFG.• What isthedifference between anLLandanLRparser?• What isthedifference between LL(1)andLL(2)?Orbetween LR(1)andLR(2)?• Construct arecursive descent parserforasimplelanguage.• Give typical examples of grammarsthat cannot behandledbyarecursive-descent parser.• Explain why context-free grammarsare more powerful than regularexpressions.• Inwhat senseare context-free grammars"context-free"?
Top Related