Syntax Analysis (Cont.) Manas Thakur

CS502: Compiler Design

Syntax Analysis (Cont.)

Manas Thakur

Fall 2020

Manas Thakur CS502: Compiler Design 2

What next?

● Bottom-up parsing

● Why?

– BU parsers are more powerful

than TD parsers● Cover more kinds of grammars

(e.g., no need to eliminate left recursion)

– More efficient as well

● Bad news: Slightly more complicated

● Good news: Well known parser generators exist

Seems like winter would never end!

parsing


Bottom-Up Parsing

● Given a string, construct a parse tree by starting at the leaves and walking up to the root.

● The process is called reduction.

– Reduce a string w to the start symbol of the grammar.

– Recall derivation from top-down parsing?


Reduction

● At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of the production.

● Basically we are constructing a rightmost derivation in reverse!

● How to decide which substring to reduce?

FFid * id * id

id FF

id

TT * id

FF

id

TT * FF

id

FF

id

TT * FF

id

TT

FF

id

TT * FF

id

TT

EEReduction

steps

E → E+T | TT → T*F | FF → id

F→id T→F F→id T→T*F E→T


Handle pruning

● A handle is a substring that matches the body of a production,

and reducing this handle represents one step of reduction.

● Theorem: If G is unambiguous, then every right-sentential form has a unique handle.

● Notice why did we say “a handle” instead of “the handle”?

● BU parsing is essentially the problem of handle pruning.

Right Sentential Form Handle Reducing Production

id1 * id

2id

1F -> id

F * id2

F T -> F

T * id2

id2

F -> id

T * F T * F T -> T * F

T T E -> T


Shift-Reduce Parsing

● Uses a stack to perform bottom-up parsing

● Four actions:

– Shift: shift the next input symbol on top of stack

– Reduce: pop handle off the stack and push the corresponding non-terminal

– Accept: parsing successful

– Error: parsing failed

● The standard scheme used by LR grammars.

Left to right scanning Rightmost derivation


LR Parsing Example

● A table guides the actions, based on the top of the stack and the next input symbol.

Stack Input Action

$ id1 * id

2 $ shift

$ id1

* id2 $ reduce by F -> id

$ F * id2 $ reduce by T -> F

$ T * id2 $ shift

$ T * id2 $ shift

$ T * id2

$ reduce by F -> id

$ T * F $ reduce by T -> T * F

$ T $ reduce by E -> T

$ E $ accept

The job of all LR parsers is toconstruct the “action” table.


LR Parsing Algorithms

● Simple LR or SLR

– Smallest class of grammars

– Smallest tables

– Simple, fast construction

● Canonical LR or CLR

– Largest set of grammars

– Largest tables

– Slow construction

● LookAhead LR or LALR

– Intermediate set of grammars

– Same number of states as CLR

– Faster construction than CLR


LR(k) Items

● An LR(k) item is a pair [α, β], where

– α is a production with a • at some position in the RHS, marking how much of the RHS has been seen

– β is a lookahead string containing k symbols (terminals or $)

● Two cases of interest:

– LR(0) items for SLR table construction

– LR(1) items for CLR and LALR table construction


Example of LR(0) items

● A → XYZ generates four LR(0) items:– [A → •XYZ]

– [A → X•YZ]

– [A → XY•Z]

– [A → XYZ•]

● [A → •XYZ] indicates that the parser is looking for a string that can be derived from XYZ

● [A → XY•Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z


CLOSURE

● Given an item [A → α • Bβ ], its closure contains the item and any other items that can generate legal substrings to follow α.

function CLOSURE(I)repeat

if [A → α • Bβ ] I∈add [B → •γ] to I

until no more items can be added to Ireturn I

E’ → EE → E+T | TT → T*F | FF → (E) | id

I = {[E’ → •E]}

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

CLOSURE(I)

Grammar G’ with anaugmented production:


GOTO

● Let I be the set of LR(0) items and X be a grammar symbol. Then, GOTO(I, X) is the closure of the set of all items

– [A → αX•β] such that [A → α•Xβ] ∈ I

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

EI1

E’ → E•E → E•+T

GOTO(I0, E) = I1

Classwork: Construct GOTO(I1, +).

E’ → EE → E+T | TT → T*F | FF → (E) | id


I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E

accept

$

I2

E → T .T → T . * F

I2

E → T .T → T . * F

T

I3

T → F .

I5

F → id .

I5

F → id .

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

id

F

id

(

F

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I7

T → T * . FF → . ( E )F → . id

I7

T → T * . FF → . ( E )F → . id

+

*

I8

E → E . + TF → ( E . )

I8

E → E . + TF → ( E . )

E

I9

E → E + T .T → T . * F

I9

E → E + T .T → T . * F

I10

T → T * F .

I10

T → T * F .

I11

F → ( E ) .

I11

F → ( E ) .

T

F

(

T

T

*

id

id

(

F

F

(

)+

id

(

LR(0) Automaton

E’ → EE → E+T | TT → T*F | FF → (E) | id



Manas Thakur

Fall 2020


Before we reach SLR● We can build a simpler than SLR parser using LR(0) item sets for

the following grammar:E’ → EE → E+T | TT → (E) | id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E accept$

I2

E → T .

I2

E → T .

T

I3

T → id .

I3

T → id .

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

id

id

(

I5

E → E + . TT → . (E) T → . id

I5

E → E + . TT → . (E) T → . id

+

I6

E → E . + TT → ( E . )

I6

E → E . + TT → ( E . )

E

I7

E → E + T .

I7

E → E + T .

I8

T → ( E ) .

I8

T → ( E ) .

T

(

T

T

id

(

)

+

id

(


Constructing LR(0) parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

– ⇒ ACTION[i,a] ← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

– ⇒ ACTION[i,a] ← “reduce A → α”, a∀– [S’ → S•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀

● GOTO(Ii, A) = Ij GOTO[i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser is CLOSURE([S’ → •S])


LR(0) Parsing Table

State id + ( ) $ E T

0 s3 s4 1 2

1 s5 accept

2 r(E→T) r(E→T) r(E→T) r(E→T) r(E→T)

3 r(T→id) r(T→id) r(T→id) r(T→id) r(T→id)

4 s3 s4 6 2

5 s3 s4 7

6 s5 s4 s8 9

7 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

8 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

E'→ EE → E + T | TF → (E) | id


Need for more powerful LR parsers

● LR(0) is too simple to cover many grammars.

● Doesn’t cover even our expression grammar:

● Recall the giant automaton:

– e.g.: s7 or r(E→T) on (I2, *)

– Called a shift-reduce conflict– Similarly we can have reduce-reduce conflicts

● Further reading: Section 4.5.4 (DB)– Multiply defined entries imply the grammar is not LR(0)

● Reason: LR(0) automata do not know on what next symbol to reduce, and end up adding too many reduce actions conservatively.

E’ → EE → E+T | TT → T*F | FF → (E) | id


Constructing SLR parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

⇒ ACTION[i,a]← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

⇒ ACTION[i,a] ← “reduce A → α”, a FOLLOW(A)∀ ∈

– [S0’ → S$•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀● GOTO(Ii, A) = Ij GOTO [i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser s0 is CLOSURE([S’ → •S$])

This is the only addition w.r.t. the LR(0) algorithm!


SLR Parsing TableState id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 accept

2 r(E→T) s7 r(E→T) r(E→T)

3 r(T→F) r(T→F) r(T→F) r(T→F)

4 s5 s4 8 2 3

5 r(F→id) r(F→id) r(F→id) r(F→id)

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r(E→E+T) s7 r(E→E+T) r(E→E+T)

10 r(T→T*F) r(T→T*F) r(T→T*F) r(T→T*F)

11 r(F→(E)) r(F→(E)) r(F→(E)) r(F→(E))

FOLLOW(E) = {+,),$}FOLLOW(T) = {+,*,),$}FOLLOW(F) = {+,*,),$}

E' → EE → E + T | TT → T * F | FF → (E) | id


SLR Parsing Example

0 $ id * id $ Shift to 5

0 5 $ id * id $ Reduce by F → id

Stack Symbols Input Action

0 3 $ F * id $ Reduce by T → F

0 2 $ T * id $ Shift to 7

0 2 7 $ T * id $ Shift to 5

0 2 7 5 $ T * id $ Reduce by F → id

0 2 7 10 $ T * F $ Reduce by T → T * F

0 2 $ T $ Reduce by E → T

0 1 $ E $ Accept

E' → EE → E + T | TT → T * F | FF → (E) | id

● Parse for id*id:

Shift si: Push current symbol and state si, move pointer.Reduce A →α: Pop |α| symbols and states. GOTO using the nex symbol


A grammar that is not SLR

S'→ SS → L = R | RL → *R | idR → L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I1

S' → S .

I1

S' → S .I2

S → L . = RR → L .

I2

S → L . = RR → L .

I3

S' → R .

I3

S' → R .I4

L →id .

I4

L →id .

I5

L → * . R L → . * RR → . LR → . id

I5

L → * . R L → . * RR → . LR → . id

I6

S → L = . RR → . LL → . *R L → . id

I6

S → L = . RR → . LL → . *R L → . id

I7

L → *R .

I7

L → *R .I8

R → L .

I8

R → L .

I9

S → L = R .

I9

S → L = R .

● Consider I2 on ‘=’:

– Shift to I6

– Reduce using R → L (as = is in FOLLOW(R); how?)– Conflict in the parsing table implies the grammar is not SLR(1)



Manas Thakur

Fall 2020


LR(1) Items

● Recall LR(k) items definition?

– An LR(k) item is a pair [α, β], where● α is a production with a • at some position in the RHS, marking how

much of the RHS has been seen● β is a lookahead string containing k symbols (terminals or $)

● LR(1) items look like [A → X • YZ, a]


CLOSURE1 and GOTO1

function CLOSURE1(I)repeat

if [A → α • Bβ, a] I∈add [B → •γ, b] to I, where b FIRST(βa)∈

until no more items can be added to Ireturn I

function GOTO1(I, X)Let J be the set of items [A → αX•β, a]

such that [A → α•Xβ, a] I∈return CLOSURE1(J)


LR(1) AutomatonS'→ SS → C CC → c C| d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ C → . d, $

I6

C → c . C, $ C → . c C, $ C → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

c

c

S

C

c

dd

d

C

c

C

C

d

Same LR(0) item, but different LR(1) items.

$accept


LR(1) or Canonical LR (CLR) Parsing Table

Homework: Construct the LR(1) parserfor our non-SLR grammar and verify that there is no shift-reduce conflict.

State c d $ S C0 s3 s4 1 2

1 accept

2 s6 s7 5

3 s3 s4 8

4 r3 r3

5 r1

6 s6 s7 9

7 r3

8 r2 r2

9 r2

0: S'→ S1: S → C C2: C → c C3: C → d


LookAhead LR (LALR) Parsing

● LR(1) parsers have too many states compared to SLR parsers.

– For C, SLR would have a few hundred states

– For C, LR(1) would have a few thousand states

● How about merging states with the same LR(0) items (aka core)?

– Result: We get LALR parsers!

● A bit of history:

– Knuth invented LR in 1965, but it was considered impractical due to memory requirements.

– Frank DeRemer invented SLR and LALR in 1969 (LALR as part of his PhD thesis).


LALR(1) Automaton

S'→ SS → C CC → c C | dI

0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $ I

3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ L → . d, $

I6

C → c . C, $ C → . c C, $ L → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

Merged states for LALR(1):

Original LR(1) states:

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I47

C → d ., c/d/$

I47

C → d ., c/d/$

I89

C → c C ., c/d/$

I89

C → c C ., c/d/$


LALR(1) Parsing Table

State c d $ S C0 s36 s47 1 2

1 accept

2 s36 s47 5

36 s36 s47 8

47 r3 r3 r3

5 r1

9 r2 r2 r2

0: S'→ S1: S → C C2: C → c C3: C → d


A few notes in passing

● LALR parsers are smaller than corresponding LR(1) parsing tables.

● LALR parsers mimic LR parsers on correct inputs.

● On erroneous inputs, LALR may proceed with reductions while LR might have declared an error.

– However, eventually, LALR is guaranteed to report the error.

● Merging sets for LALR never generates SR conflicts, but can generate RR conflicts.

– Further reading: Section 4.7.4.

● Difference between SLR and LALR?

– Both have same LR(0) item sets!

– Difference lies in the lookahead.● The lookaheads in LALR can be proved to be a subset of the

FOLLOW sets in SLR.


Using ambiguous grammars

● Ambiguous grammars should be used sparingly.

● However, they can sometimes feel more natural to write; e.g.:

● Sometimes easier to resolve a resulting conflict by hard-coding:

– Higher priority to shift or reduce

– Higher priority to a certain reduce

● However, it is an ad-hoc way and is better avoided.

E → E + E | E * E | id versusE → E+T | TT → T*F | FF → id


Error handling in parsers

● Ignore till a synchronizing token (such as } or ;):

– Pop the stack

– Discard input symbols

– Resume parsing

● Attach semantic error actions to grammar rules

– Add tokens based on what is missing (e.g., closing parenthesis)

● Programmer-specified substitutions

– %change directive in some parser specifications

● Global error recovery

– Again more of theoretical interest


The Big Grammatical Picture

Clicked from “Modern Compiler Implementation in Java” by Andrew W. Appel.

Syntax Analysis (Cont.) Manas Thakur

Documents

Transcript of Syntax Analysis (Cont.) Manas Thakur