Syntax Analysis (Cont.) Manas Thakur
Transcript of Syntax Analysis (Cont.) Manas Thakur
CS502: Compiler Design
Syntax Analysis (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 2
What next?
● Bottom-up parsing
● Why?
– BU parsers are more powerful
than TD parsers● Cover more kinds of grammars
(e.g., no need to eliminate left recursion)
– More efficient as well
● Bad news: Slightly more complicated
● Good news: Well known parser generators exist
Seems like winter would never end!
parsing
Manas Thakur CS502: Compiler Design 3
Bottom-Up Parsing
● Given a string, construct a parse tree by starting at the leaves and walking up to the root.
● The process is called reduction.
– Reduce a string w to the start symbol of the grammar.
– Recall derivation from top-down parsing?
Manas Thakur CS502: Compiler Design 4
Reduction
● At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of the production.
● Basically we are constructing a rightmost derivation in reverse!
● How to decide which substring to reduce?
FFid * id * id
id FF
id
TT * id
FF
id
TT * FF
id
FF
id
TT * FF
id
TT
FF
id
TT * FF
id
TT
EEReduction
steps
E → E+T | TT → T*F | FF → id
F→id T→F F→id T→T*F E→T
Manas Thakur CS502: Compiler Design 5
Handle pruning
● A handle is a substring that matches the body of a production,
and reducing this handle represents one step of reduction.
● Theorem: If G is unambiguous, then every right-sentential form has a unique handle.
● Notice why did we say “a handle” instead of “the handle”?
● BU parsing is essentially the problem of handle pruning.
Right Sentential Form Handle Reducing Production
id1 * id
2id
1F -> id
F * id2
F T -> F
T * id2
id2
F -> id
T * F T * F T -> T * F
T T E -> T
Manas Thakur CS502: Compiler Design 6
Shift-Reduce Parsing
● Uses a stack to perform bottom-up parsing
● Four actions:
– Shift: shift the next input symbol on top of stack
– Reduce: pop handle off the stack and push the corresponding non-terminal
– Accept: parsing successful
– Error: parsing failed
● The standard scheme used by LR grammars.
Left to right scanning Rightmost derivation
Manas Thakur CS502: Compiler Design 7
LR Parsing Example
● A table guides the actions, based on the top of the stack and the next input symbol.
Stack Input Action
$ id1 * id
2 $ shift
$ id1
* id2 $ reduce by F -> id
$ F * id2 $ reduce by T -> F
$ T * id2 $ shift
$ T * id2 $ shift
$ T * id2
$ reduce by F -> id
$ T * F $ reduce by T -> T * F
$ T $ reduce by E -> T
$ E $ accept
The job of all LR parsers is toconstruct the “action” table.
Manas Thakur CS502: Compiler Design 8
LR Parsing Algorithms
● Simple LR or SLR
– Smallest class of grammars
– Smallest tables
– Simple, fast construction
● Canonical LR or CLR
– Largest set of grammars
– Largest tables
– Slow construction
● LookAhead LR or LALR
– Intermediate set of grammars
– Same number of states as CLR
– Faster construction than CLR
Manas Thakur CS502: Compiler Design 9
LR(k) Items
● An LR(k) item is a pair [α, β], where
– α is a production with a • at some position in the RHS, marking how much of the RHS has been seen
– β is a lookahead string containing k symbols (terminals or $)
● Two cases of interest:
– LR(0) items for SLR table construction
– LR(1) items for CLR and LALR table construction
Manas Thakur CS502: Compiler Design 10
Example of LR(0) items
● A → XYZ generates four LR(0) items:– [A → •XYZ]
– [A → X•YZ]
– [A → XY•Z]
– [A → XYZ•]
● [A → •XYZ] indicates that the parser is looking for a string that can be derived from XYZ
● [A → XY•Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z
Manas Thakur CS502: Compiler Design 11
CLOSURE
● Given an item [A → α • Bβ ], its closure contains the item and any other items that can generate legal substrings to follow α.
function CLOSURE(I)repeat
if [A → α • Bβ ] I∈add [B → •γ] to I
until no more items can be added to Ireturn I
E’ → EE → E+T | TT → T*F | FF → (E) | id
I = {[E’ → •E]}
I0
E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id
CLOSURE(I)
Grammar G’ with anaugmented production:
Manas Thakur CS502: Compiler Design 12
GOTO
● Let I be the set of LR(0) items and X be a grammar symbol. Then, GOTO(I, X) is the closure of the set of all items
– [A → αX•β] such that [A → α•Xβ] ∈ I
I0
E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id
EI1
E’ → E•E → E•+T
GOTO(I0, E) = I1
Classwork: Construct GOTO(I1, +).
E’ → EE → E+T | TT → T*F | FF → (E) | id
Manas Thakur CS502: Compiler Design 13
I0
E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id
I0
E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id
I1
E' → E .E → E . + T
I1
E' → E .E → E . + T
E
accept
$
I2
E → T .T → T . * F
I2
E → T .T → T . * F
T
I3
T → F .
I5
F → id .
I5
F → id .
I4
F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id
I4
F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id
id
F
id
(
F
I6
E → E + . TT → . T * F T → . FF → . (E) F → . id
I6
E → E + . TT → . T * F T → . FF → . (E) F → . id
I7
T → T * . FF → . ( E )F → . id
I7
T → T * . FF → . ( E )F → . id
+
*
I8
E → E . + TF → ( E . )
I8
E → E . + TF → ( E . )
E
I9
E → E + T .T → T . * F
I9
E → E + T .T → T . * F
I10
T → T * F .
I10
T → T * F .
I11
F → ( E ) .
I11
F → ( E ) .
T
F
(
T
T
*
id
id
(
F
F
(
)+
id
(
LR(0) Automaton
E’ → EE → E+T | TT → T*F | FF → (E) | id
CS502: Compiler Design
Syntax Analysis (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 15
Before we reach SLR● We can build a simpler than SLR parser using LR(0) item sets for
the following grammar:E’ → EE → E+T | TT → (E) | id
I0
E' → . EE → . E + TE → . TT → . (E) T → . id
I0
E' → . EE → . E + TE → . TT → . (E) T → . id
I1
E' → E .E → E . + T
I1
E' → E .E → E . + T
E accept$
I2
E → T .
I2
E → T .
T
I3
T → id .
I3
T → id .
I4
T → ( . E )E → . E + TE → . TT → . (E) T → . id
I4
T → ( . E )E → . E + TE → . TT → . (E) T → . id
id
id
(
I5
E → E + . TT → . (E) T → . id
I5
E → E + . TT → . (E) T → . id
+
I6
E → E . + TT → ( E . )
I6
E → E . + TT → ( E . )
E
I7
E → E + T .
I7
E → E + T .
I8
T → ( E ) .
I8
T → ( E ) .
T
(
T
T
id
(
)
+
id
(
Manas Thakur CS502: Compiler Design 16
Constructing LR(0) parsing table● Construct the LR(0) item sets for G’
– G’ is G with an augmented start production S’ → S
● State i is constructed using set Ii
– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij
– ⇒ ACTION[i,a] ← “shift j”, a != $∀– [A → α•] I∈ i, A != S’
– ⇒ ACTION[i,a] ← “reduce A → α”, a∀– [S’ → S•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀
● GOTO(Ii, A) = Ij GOTO[i, A] ← j⇒
● Set undefined entries in ACTION and GOTO to “error”
● Initial state of parser is CLOSURE([S’ → •S])
Manas Thakur CS502: Compiler Design 17
LR(0) Parsing Table
State id + ( ) $ E T
0 s3 s4 1 2
1 s5 accept
2 r(E→T) r(E→T) r(E→T) r(E→T) r(E→T)
3 r(T→id) r(T→id) r(T→id) r(T→id) r(T→id)
4 s3 s4 6 2
5 s3 s4 7
6 s5 s4 s8 9
7 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)
8 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)
E'→ EE → E + T | TF → (E) | id
Manas Thakur CS502: Compiler Design 18
Need for more powerful LR parsers
● LR(0) is too simple to cover many grammars.
● Doesn’t cover even our expression grammar:
● Recall the giant automaton:
– e.g.: s7 or r(E→T) on (I2, *)
– Called a shift-reduce conflict– Similarly we can have reduce-reduce conflicts
● Further reading: Section 4.5.4 (DB)– Multiply defined entries imply the grammar is not LR(0)
● Reason: LR(0) automata do not know on what next symbol to reduce, and end up adding too many reduce actions conservatively.
E’ → EE → E+T | TT → T*F | FF → (E) | id
Manas Thakur CS502: Compiler Design 19
Constructing SLR parsing table● Construct the LR(0) item sets for G’
– G’ is G with an augmented start production S’ → S
● State i is constructed using set Ii
– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij
⇒ ACTION[i,a]← “shift j”, a != $∀– [A → α•] I∈ i, A != S’
⇒ ACTION[i,a] ← “reduce A → α”, a FOLLOW(A)∀ ∈
– [S0’ → S$•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀● GOTO(Ii, A) = Ij GOTO [i, A] ← j⇒
● Set undefined entries in ACTION and GOTO to “error”
● Initial state of parser s0 is CLOSURE([S’ → •S$])
This is the only addition w.r.t. the LR(0) algorithm!
Manas Thakur CS502: Compiler Design 20
SLR Parsing TableState id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 accept
2 r(E→T) s7 r(E→T) r(E→T)
3 r(T→F) r(T→F) r(T→F) r(T→F)
4 s5 s4 8 2 3
5 r(F→id) r(F→id) r(F→id) r(F→id)
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r(E→E+T) s7 r(E→E+T) r(E→E+T)
10 r(T→T*F) r(T→T*F) r(T→T*F) r(T→T*F)
11 r(F→(E)) r(F→(E)) r(F→(E)) r(F→(E))
FOLLOW(E) = {+,),$}FOLLOW(T) = {+,*,),$}FOLLOW(F) = {+,*,),$}
E' → EE → E + T | TT → T * F | FF → (E) | id
Manas Thakur CS502: Compiler Design 21
SLR Parsing Example
0 $ id * id $ Shift to 5
0 5 $ id * id $ Reduce by F → id
Stack Symbols Input Action
0 3 $ F * id $ Reduce by T → F
0 2 $ T * id $ Shift to 7
0 2 7 $ T * id $ Shift to 5
0 2 7 5 $ T * id $ Reduce by F → id
0 2 7 10 $ T * F $ Reduce by T → T * F
0 2 $ T $ Reduce by E → T
0 1 $ E $ Accept
E' → EE → E + T | TT → T * F | FF → (E) | id
● Parse for id*id:
Shift si: Push current symbol and state si, move pointer.Reduce A →α: Pop |α| symbols and states. GOTO using the nex symbol
Manas Thakur CS502: Compiler Design 22
A grammar that is not SLR
S'→ SS → L = R | RL → *R | idR → L
I0
S' → . SS → . L = RS → . RL → . *R L → . idR → . L
I0
S' → . SS → . L = RS → . RL → . *R L → . idR → . L
I1
S' → S .
I1
S' → S .I2
S → L . = RR → L .
I2
S → L . = RR → L .
I3
S' → R .
I3
S' → R .I4
L →id .
I4
L →id .
I5
L → * . R L → . * RR → . LR → . id
I5
L → * . R L → . * RR → . LR → . id
I6
S → L = . RR → . LL → . *R L → . id
I6
S → L = . RR → . LL → . *R L → . id
I7
L → *R .
I7
L → *R .I8
R → L .
I8
R → L .
I9
S → L = R .
I9
S → L = R .
● Consider I2 on ‘=’:
– Shift to I6
– Reduce using R → L (as = is in FOLLOW(R); how?)– Conflict in the parsing table implies the grammar is not SLR(1)
CS502: Compiler Design
Syntax Analysis (Cont.)
Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 24
LR(1) Items
● Recall LR(k) items definition?
– An LR(k) item is a pair [α, β], where● α is a production with a • at some position in the RHS, marking how
much of the RHS has been seen● β is a lookahead string containing k symbols (terminals or $)
● LR(1) items look like [A → X • YZ, a]
Manas Thakur CS502: Compiler Design 25
CLOSURE1 and GOTO1
function CLOSURE1(I)repeat
if [A → α • Bβ, a] I∈add [B → •γ, b] to I, where b FIRST(βa)∈
until no more items can be added to Ireturn I
function GOTO1(I, X)Let J be the set of items [A → αX•β, a]
such that [A → α•Xβ, a] I∈return CLOSURE1(J)
Manas Thakur CS502: Compiler Design 26
LR(1) AutomatonS'→ SS → C CC → c C| d
I0
S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d
I0
S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d
I2
S → C . C, $ C → . c C, $ C → . d, $
I2
S → C . C, $ C → . c C, $ C → . d, $
I3
C → c . C, c/d C → . c C, c/d C → . d, c/d
I3
C → c . C, c/d C → . c C, c/d C → . d, c/d
I1
S' → S ., $
I1
S' → S ., $
I4
C → d ., c/d
I4
C → d ., c/d
I6
C → c . C, $ C → . c C, $ C → . d, $
I6
C → c . C, $ C → . c C, $ C → . d, $
I5
S → CC ., $
I5
S → CC ., $
I7
C → d ., $
I7
C → d ., $
I8
C → c C ., c/d
I8
C → c C ., c/d
I9
C → c C ., $
I9
C → c C ., $
c
c
S
C
c
dd
d
C
c
C
C
d
Same LR(0) item, but different LR(1) items.
$accept
Manas Thakur CS502: Compiler Design 27
LR(1) or Canonical LR (CLR) Parsing Table
Homework: Construct the LR(1) parserfor our non-SLR grammar and verify that there is no shift-reduce conflict.
State c d $ S C0 s3 s4 1 2
1 accept
2 s6 s7 5
3 s3 s4 8
4 r3 r3
5 r1
6 s6 s7 9
7 r3
8 r2 r2
9 r2
0: S'→ S1: S → C C2: C → c C3: C → d
Manas Thakur CS502: Compiler Design 28
LookAhead LR (LALR) Parsing
● LR(1) parsers have too many states compared to SLR parsers.
– For C, SLR would have a few hundred states
– For C, LR(1) would have a few thousand states
● How about merging states with the same LR(0) items (aka core)?
– Result: We get LALR parsers!
● A bit of history:
– Knuth invented LR in 1965, but it was considered impractical due to memory requirements.
– Frank DeRemer invented SLR and LALR in 1969 (LALR as part of his PhD thesis).
Manas Thakur CS502: Compiler Design 29
LALR(1) Automaton
S'→ SS → C CC → c C | dI
0
S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d
I0
S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d
I2
S → C . C, $ C → . c C, $ C → . d, $
I2
S → C . C, $ C → . c C, $ C → . d, $ I
3
C → c . C, c/d C → . c C, c/d C → . d, c/d
I3
C → c . C, c/d C → . c C, c/d C → . d, c/d
I1
S' → S ., $
I1
S' → S ., $
I4
C → d ., c/d
I4
C → d ., c/d
I6
C → c . C, $ C → . c C, $ L → . d, $
I6
C → c . C, $ C → . c C, $ L → . d, $
I5
S → CC ., $
I5
S → CC ., $
I7
C → d ., $
I7
C → d ., $
I8
C → c C ., c/d
I8
C → c C ., c/d
I9
C → c C ., $
I9
C → c C ., $
Merged states for LALR(1):
Original LR(1) states:
I36
C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$
I36
C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$
I47
C → d ., c/d/$
I47
C → d ., c/d/$
I89
C → c C ., c/d/$
I89
C → c C ., c/d/$
Manas Thakur CS502: Compiler Design 30
LALR(1) Parsing Table
State c d $ S C0 s36 s47 1 2
1 accept
2 s36 s47 5
36 s36 s47 8
47 r3 r3 r3
5 r1
9 r2 r2 r2
0: S'→ S1: S → C C2: C → c C3: C → d
Manas Thakur CS502: Compiler Design 31
A few notes in passing
● LALR parsers are smaller than corresponding LR(1) parsing tables.
● LALR parsers mimic LR parsers on correct inputs.
● On erroneous inputs, LALR may proceed with reductions while LR might have declared an error.
– However, eventually, LALR is guaranteed to report the error.
● Merging sets for LALR never generates SR conflicts, but can generate RR conflicts.
– Further reading: Section 4.7.4.
● Difference between SLR and LALR?
– Both have same LR(0) item sets!
– Difference lies in the lookahead.● The lookaheads in LALR can be proved to be a subset of the
FOLLOW sets in SLR.
Manas Thakur CS502: Compiler Design 32
Using ambiguous grammars
● Ambiguous grammars should be used sparingly.
● However, they can sometimes feel more natural to write; e.g.:
● Sometimes easier to resolve a resulting conflict by hard-coding:
– Higher priority to shift or reduce
– Higher priority to a certain reduce
● However, it is an ad-hoc way and is better avoided.
E → E + E | E * E | id versusE → E+T | TT → T*F | FF → id
Manas Thakur CS502: Compiler Design 33
Error handling in parsers
● Ignore till a synchronizing token (such as } or ;):
– Pop the stack
– Discard input symbols
– Resume parsing
● Attach semantic error actions to grammar rules
– Add tokens based on what is missing (e.g., closing parenthesis)
● Programmer-specified substitutions
– %change directive in some parser specifications
● Global error recovery
– Again more of theoretical interest
Manas Thakur CS502: Compiler Design 34
The Big Grammatical Picture
Clicked from “Modern Compiler Implementation in Java” by Andrew W. Appel.