COMP2100/6442 Software Design Methodologies / …• Recursive descent parser –Top-down parser...

Sid Chi-Kin Chau

COMP2100/6442

Software Design Methodologies / Software Construction

Parsing

Admin

• IA due was the last Friday

• Experimenting online labs with local students

• Lab 4 will start next week

– Implementing tokenizer & parser

• Quiz 1 has been opened

– Via Wattle

– Read instructions carefully before you start!

– 10 multiple choice questions

2

Admin

• Lab test 1

– Carried out in your enrolled lab (if labs not

suspended)

– 10 possible questions will be announced

– You will solve randomly chosen 2 questions

+ 3 fixed questions

– Possibility to waive the lab test

3

In this lecture, we will look at

• How the source codes are analyzed,

compiled, and executed.

• Structured text

• Parsing

• Tokenization

• Grammars

4

Structured Text

• Information is often stored in unstructured text files

• We often input information to computer via text files

– Markup language: Html, json, xml, markdown…

– Programming language: Java, c, c++, Haskell, perl,

python, php, …

– Mathematical expressions, e.g. {(2+3)*4}/2

• Need to extract meaningful information for

computers

• Need rules for writing and reading

5

Parsing

• Also called syntax (syntactic) analysis

• Aim to understand the exact meaning of

structured text

– Resulting in parse tree, a representation of

structured text

– Proceeded by a tokenizer

• Need a grammar to generate parse tree

– Test whether a text conforms to the grammar

6

7

xx + yyy = zz;(x+y)/z = a;a – b – c = x; Tokenizer Parser

<exp> ::=

<term> |

<term> +

<exp> | <term>

- <exp>

Parse Trees

Interpreter/

Compiler

Parsing Process

Source text

Grammars

Executable file

xx

+

yyy

…tokens

0101010100000011110011110101010111111010010…

TokenizationFrom text to tokens

a.k.a lexical analyzer

8

Tokenization in natural language

• Process of converting a sequence of characters

into a sequence of tokens

• Natural language tokenization

“I want to tokenize this sentence.”

→ I / want / to / tokenize / this / sentence / .

– Token is just a vocabulary word in this case.

+ some punctuation marks

9

Tokenization in Structured Text

• A token is a string with an assigned meaning

• Structured as a pair consisting of a token

kind(type) and an optional token value.

public void myMethod(int var1)→ (keyword, “public”), (keyword, “void”), (identifier, “myMethod”), (punctuation, “(“), (keyword, “int”), (identifier, “var1”) …

• Sometimes with location information10

The Output

• A series of tokens: kind, location, name (if any)

– Punctuation ( ) ; , [ ]

– Operators + - ** :=

– Keywords begin end if while try catch

– Identifiers Square_Root

– String literals “press Enter to continue”

– Character literals ‘x’

– Numeric literals

• Integer: 123

• Floating point: 5.23e+2

11

Punctuation: Separators

• Typically individual special characters

such as ( { } : . ;

– Sometimes double characters: tokenizer looks

for longest token:

• /*, //, -- comment openers in various languages

– Returned just as identity (kind) of token

• And perhaps location for error messages and

debugging purposes

12

Operators

• Like punctuation

– No real difference for tokenizer

– Typically single or double special chars

• Operators + - == <=

• Operations :=

– Returned as kind of token

• And perhaps location

13

Identifier & Keywords

• Identifier: function names, variable names

– Length, allowed characters, separators

• Need to build a names table

– Single entry for all occurrences of var1, myFunction

– Typical structure: hash table

• Tokenizer returns token kind

– And key (index) to table entry

– Table entry includes location information

• Keywords: Reserved identifiers– E.g. BEGIN END in Pascal, if in C, catch in C++

14

Literals

• String literals

– “example”

• Character literals

– ‘c’

• Numeric literals

– 123 (Integer)

– 123.456 (Double)

15

Free form vs Fixed form

• Free form languages (modern ones)

– White space does not matter. Ignore these:• Tabs, spaces, new lines, carriage returns

– Only the ordering of tokens is important

• Fixed format languages (historical)

– Layout is critical• Fortran

• Python, indentation

• Tokenizer must know about layout to find tokens

16

Case Equivalence

• Some languages are case-insensitive

– Pascal, Ada

• Some are not

– C, Java

• Tokenizer ignores case if needed

– This_Routine == THIS_RouTine

– Error analysis may need exact casing

17

General Approach

• Define set of token kinds:

– An enumeration type (tok_int, tok_if, tok_plus,

tok_left_paren, etc).

• Tokeniser returns a pair consisting of a

token name and an optional token value

– Some tokens carry associated data

– E.g. Location in the text

– key for identifier table

18

Abstract class Tokenizer

public abstract class Tokenizer {

// extract next token from the current text and save it

public abstract void next();

// return the current token (without type information)

public abstract Object current();

//check whether there is a token remaining in the text

public abstract boolean hasNext();

}

Download the example code at: https://gitlab.cecs.anu.edu.au/u1063268/comp2100-6442-materials/

19

https://gitlab.cecs.anu.edu.au/u1063268/comp2100-6442-materials/

20

public class MySimpleTokenizer extends Tokenizer {

private String text; // save text

private int pos; // current position

private Object current; // save token extracted

static final char symbol[] = { '*', '+', '(', ')', ';'};

public void next() {

consumewhite();

if (pos == text.length()) {

current = null; // end of text} else if (isin(text.charAt(pos), symbol)) {

current = "" + text.charAt(pos);

pos++; //extract predefined symbol

}

Parsing numeral

21

else if (Character.isDigit(text.charAt(pos))) {int start = pos;while (pos < text.length() &&

Character.isDigit(text.charAt(pos)) )pos++;

// check period in a sequence. Note that valid double has only single period in a seq

if (pos+1 < text.length() && text.charAt(pos) == '.' &&Character.isDigit(text.charAt(pos+1))) {pos++;while (pos < text.length() &&

Character.isDigit(text.charAt(pos)))pos++;

current = Double.parseDouble(text.substring(start, pos));

} else {current = Integer.parseInt(text.substring(start,

pos));}

}

ParsingFrom tokens to tree

22

Grammars

• Grammars express languages

• Example: English grammar (simplified)

23

verbpredicate

nounarticlephrasenoun

predicatephrasenounsentence

→

→

→

_

_

24

sleepsverb

runsverb

dognoun

catnoun

thearticle

aarticle

→

→

→

→

→

→

Derivation of string “the dog walks”:

25

sleepsdogthe

verbdogthe

verbnounthe

verbnounarticle

verbphrasenoun


_

_

Derivation of string “a cat runs”:

26

runscata

verbcata

verbnouna

verbnounarticle

verbphrasenoun


_

_

Language of the grammar

• Language: all possible derivations

27

L = { “a cat runs”,

“a cat sleeps”,

“the cat runs”,

“the cat sleeps”,

“a dog runs”,

“a dog sleeps”,

“the dog runs”,

“the dog sleeps” }

Context Free Grammars

• Grammars provide a precise way of specifying

language.

• A context free grammar is often used to define the

syntax

• A context free grammar is specified via a set

production rules (→ or ::=) with

– Variables (surrounded with <> )

– Terminals (symbols)

– Alternatives ( | )

28

Production / Variable terminals

29

catnoun →

Variables

Terminal (symbol)

predicatephrasenounsentence _→

Sequence of Variables

Alternatives |

• Conventional notation

30

thearticle

aarticle

→

→theaarticle |→

Another Example 1

• Grammar for integer

– including numbers such as 0111

– Question: how to remove such number?31

< 𝑛𝑢𝑚 >→< 𝑑𝑖𝑔𝑖𝑡 >< 𝑛𝑢𝑚 > | < 𝑑𝑖𝑔𝑖𝑡 >

< 𝑑𝑖𝑔𝑖𝑡 > → 0 1 2 3 4 5 6 7 8|9

32

Formal Definitions

( )PSTVG ,,,=

Set of variables

Set of terminal symbols

Start variable

Set of productions

Grammar:

With example

33

( )PSTVG ,,,=

variables

terminals

productions

𝑆 = < 𝑛𝑢𝑚 >start variable

< 𝑛𝑢𝑚 >→< 𝑑𝑖𝑔𝑖𝑡 >< 𝑛𝑢𝑚 > | < 𝑑𝑖𝑔𝑖𝑡 >< 𝑑𝑖𝑔𝑖𝑡 > → 0 1 2 3 4 5 6 7 8|9

𝑉 = {< 𝑛𝑢𝑚 >,< 𝑑𝑖𝑔𝑖𝑡 >}

𝑇 = {0, 1, 2,… , 9}

Grammar for mathematical expressions

• 𝑉 = {𝐸}

• 𝑇 = {𝑎, +, ∗, (, )}

– Let’s assume ‘a’ can be any number

• 𝑆 = 𝐸

– All mathematical expression starts from 𝐸34

aEEEEEE |)(|| +→

𝑃 = 4 Production rules

Derivation for 𝑎 + 𝑎 ∗ 𝑎

35

E

EE

EE

+

a

a a

𝐸 ⟹ 𝐸 + 𝐸⟹ 𝑎 + 𝐸⟹ 𝑎 + 𝐸 ∗ 𝐸⟹ 𝑎 + 𝑎 ∗ 𝐸⟹ 𝑎 + 𝑎 ∗ 𝑎

Parse Tree!

aEEEEEE |)(|| +→

Evaluate expression via Tree

• Parsing tree defines

an evaluation order

• Evaluation from

leaves to root

• Intermediate

variables replaced

by terminal values

36

E

EE

EE

+

2

2 2

2

2

4

2

6

Ambiguity!

37

E

EE

+

EE

E

EE

EE

+

2

2 2 2 2

2

4

2 2

2

6

2 2

24

8

Parse tree 1 Parse tree 2

Ambiguous Grammar

• A context free grammar G is ambiguous

• If there is a string w from language of

grammar G which has

– Two (or more) different derivation trees

– Example:

38

aEEEEEE |)(|| +→

Remove ambiguity

39

Ambiguous Grammar Non-ambiguous Grammar

𝐸 → 𝑇 + 𝐸|𝑇𝑇 → 𝐹 ∗ 𝑇|𝐹𝐹 → 𝐸 |𝑎

𝐸 → 𝐸 + 𝐸𝐸 → 𝐸 ∗ 𝐸𝐸 → 𝐸 |𝑎

Unique derivation tree for 𝑎 + 𝑎 ∗ 𝑎

40

E

+

F

a

𝐸 → 𝑇 + 𝐸|𝑇𝑇 → 𝐹 ∗ 𝑇|𝐹𝐹 → 𝐸 |𝑎

𝑇 𝑇

F 𝑇

a

a

F

Implementing a parser

• Recursive descent parser

– Top-down parser

– Parse from start variable, recursively parse input

tokens

– Create a method for each l.h.s. variable in the

grammar

– These methods are responsible for generating

parsed nodes

41

Variable (& Symbol) as a node

• To construct a parse tree,

– We can adopt ideas from binary search tree

– Each variable (& symbol) can be represented

as a node in a tree

– In BST, there’s only two types of nodes:

NonEmptyBinaryTree.java, EmptyBinaryTree.java

– Define a node class for each variable

42

• Given grammar start with Exp:

Exp → Term + Exp | Term

Term → Factor * Term | Factor

Factor → (Exp) | a

• Node classes– Public abstract class Exp

– Public class Term extends Exp

– Public class Factor extends Exp

– Public class Int extends Exp

– Public class AddExp extends Exp

– Public class MulExp extends Exp

43

Implement Parser

• Given grammar:Exp → Term + Exp | Term



• We can define a parse method for each

variable

44

To get a next token

public Exp parseExp(Tokenizer)

Return type: Exp node (abstract class)

(You may set tokenizer as

a member of parser class)

Addition between

term and exp

If the next token is +,

apply first production rule

Pseudo code for parsing

Exp → Term + Exp | Term

public Exp parseExp(tok){

Term term = parseTerm(tok);

if(tok.current()==‘+’){

tok.next();

Exp exp = parseExp(tok);

Return new AddExp(term, exp);

}else{

Return Exp(term)

}

}

45

Both production rule

starts with Term

Multiplication between

Factor and Term

If the next token is *,

apply first production rule

Pseudo code for parsing


public Exp parseTerm(tok){

Factor factor = parseFactor(tok);

if(tok.current()==‘*’){

tok.next();

Term term = parseTerm(tok);

Return MulExp(factor, term);

}else{

Return Term(factor)

}

}

46

Both production rule

starts with Factor

The second

production rule

If the next token is (, apply

the first production rulePseudo code for parsing


public Exp parseFactor(tok){

if(tok.current()==‘(’){

tok.next();

Exp exp = parseExp(tok);

tok.next();

Return exp;

}else{

Int int = new Int(tok.current());

Return int;

}

}

47

Remove ‘)’

RDP - Limitations

• The recursive descent parsing approach will not work with left

recursive grammars. For example:

– < 𝑏𝑖𝑛𝑎𝑟𝑦 >→ < 𝑏𝑖𝑛𝑎𝑟𝑦 >< 𝑑𝑖𝑔𝑖𝑡 > | < 𝑑𝑖𝑔𝑖𝑡 >

– < 𝑑𝑖𝑔𝑖𝑡 >→ 0|1

– could not be parsed using the recursive descent parser

• However we could transform the grammar into:

– < 𝑏𝑖𝑛𝑎𝑟𝑦 >→< 𝑑𝑖𝑔𝑖𝑡 >< 𝑏𝑖𝑛𝑎𝑟𝑦 > | < 𝑑𝑖𝑔𝑖𝑡 >

– < 𝑑𝑖𝑔𝑖𝑡 >→ 0|1

– which represents the same language, yet, can be parsed by the

recursive descent parser

48

Lab tasks

• Check out sample code first

• Your lab task is to improve the example

tokenizer and parser

– With Token class

– With slightly more complex grammar

49

Acknowledgement

The lecture slides are based on those of

previous instructors:

• Dongwoo Kim

• Eric McCreath

COMP2100/6442 Software Design Methodologies / …• Recursive descent parser –Top-down parser...

Documents

Transcript of COMP2100/6442 Software Design Methodologies / …• Recursive descent parser –Top-down parser...