Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf ·...

44
Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1 Case Study Wren and Wren Intermediate Code Wren is one of two teaching languages developed in the textbook Formal Syntax and Semantics of Programming Languages by Ken Slonneger and Barry Kurtz. It is a simple language that provides two data types (integer and boolean) and three primary control flow mechanisms: sequencing commands, an if command, and a while command. The other teaching language in the textbook is called Pelican; it introduces constants, procedures, and parameter lists. Although these features are useful in studying programming language semantics, they are not needed in this textbook so only the simpler language Wren is studied here. Both languages were inspired by the PL/0 language used in Niklaus Wirth’s seminal textbook Algorithms + Data Structures = Programs (1975). 2.1.1 An Informal Description of Wren Wren is a small imperative language designed for teaching purposes. It has a Pascal-like appearance with a declaration section before a command section. program <program name> is <declaration section> begin <command section> end There are only two types of variables: integer and boolean. Here is a typical declaration section. var m,n : integer; var done : boolean; There are two input/output commands: read and write, where read requires an integer variable and write outputs an integer expression. read m; read n; write m + n The assignment command is of the form: <target variable> := <expression> where the expression type must match the target variable type. m := m * 2; n := n / 2; done := m = n If you are a frequent programmer in a C-based language such as C, C++, Java or C#, there are some differences that need to be noted. The assignment operator is := and not =. The equal operator is = and not ==. And, perhaps most important, ; is used to separate two commands and not terminate a single command. This means the final command in a sequence of commands is not terminated with a semicolon.

Transcript of Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf ·...

Page 1: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 1

2.1 Case Study – Wren and Wren Intermediate Code Wren is one of two teaching languages developed in the textbook Formal Syntax and Semantics of Programming Languages by Ken Slonneger and Barry Kurtz. It is a simple language that provides two data types (integer and boolean) and three primary control flow mechanisms: sequencing commands, an if command, and a while command. The other teaching language in the textbook is called Pelican; it introduces constants, procedures, and parameter lists. Although these features are useful in studying programming language semantics, they are not needed in this textbook so only the simpler language Wren is studied here. Both languages were inspired by the PL/0 language used in Niklaus Wirth’s seminal textbook Algorithms + Data Structures = Programs (1975). 2.1.1 An Informal Description of Wren Wren is a small imperative language designed for teaching purposes. It has a Pascal-like appearance with a declaration section before a command section. program <program name> is

<declaration section>

begin

<command section>

end

There are only two types of variables: integer and boolean. Here is a typical declaration section. var m,n : integer;

var done : boolean;

There are two input/output commands: read and write, where read requires an integer variable and write outputs an integer expression. read m;

read n;

write m + n

The assignment command is of the form: <target variable> := <expression> where the expression type must match the target variable type. m := m * 2;

n := n / 2;

done := m = n

If you are a frequent programmer in a C-based language such as C, C++, Java or C#, there are some differences that need to be noted. The assignment operator is := and not =. The equal operator is = and not ==. And, perhaps most important, ; is used to separate two commands and not terminate a single command. This means the final command in a sequence of commands is not terminated with a semicolon.

Page 2: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 2

There is a double alternative if command and a single alternative if command: if <boolean expression> then if <boolean expression> then

<command sequence> <command sequence>

else end if

<command sequence>

end if

The semantics of the if commands are the same as C-based languages. Notice the following syntactic differences with C-based languages: (1) key words, such as then and else, are used to bracket a command sequence as contrasted with { … } in C-based languages (2) key words, if and then, are used to bracket Boolean expressions as contrasted with ( … ) in C-based languages There is a while command:

while <boolean expression> do

<command sequence>

end while

The semantics of the while command is the same as C-based languages with exit on a false Boolean expression value. Notice the following syntactic differences with C-based languages: (1) key words, do and end while, are used to bracket a command sequence as contrasted with { … } in C-based languages (2) key words, while and do, are used to bracket Boolean expressions as contrasted with ( … ) in C-based languages Integer expressions are the same as C-based languages but there is no mod operation. Boolean expressions can contain <, <=, =, >, >=, <> (not equal) as well as and, or, not. Here is a complete Wren program that reads in two integer values, stored in m and n, and writes out their greatest common divisor (gcd). program gcd is

var m,n : integer;

begin

read m; read n;

while m <> n do

if m < n then

n := n - m

else

m := m - n

end if

end while;

write m

end

2.1.2 Some Sample Programs in Wren Here are two more complete programs in Wren that perform simple arithmetic operations. The first program finds the product of two numbers by using a “double-halve” algorithm.

Page 3: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 3

This is a high level language version of a multiplication algorithm that would normally be written is assembly language where *2 is a shift left and /2 is a shift right. program product is

var a,b,p : integer;

begin

read a; read b; p := 0;

while b > 0 do

if (b - (b/2) * 2) > 0 then

p := p + a

end if;

a := a * 2;

b := b / 2

end while;

write p

end

Although Wren does not have a mod operator, the expression (b - (b/2) * 2) is

arithmetically equivalent to b mod 2. Logically the if command is “if b is odd then …”.

The next program finds the quotient and remainder for two integer values, also using a “double-halve” algorithm. program quotient is

var x,y,r,q,w : integer;

begin

read x;

read y;

r:=x;

q:=0;

w:=y;

while w <= r do

w := 2 * w

end while;

while w > y do

q := q * 2;

w := w / 2;

if w <= r then

r := r - w;

q := q + 1

end if

end while;

write q;

write r

end

Page 4: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 4

The three programs presented, gcd, product, and quotient, provide enough complexity that they will be used in future case studies to test our lexical analyzer, parser, and interpreter for the Wren language. 2.1.3 An Informal Description of Wren Intermediate Code There are two instructions to perform interactive input and output: get and put. Both have a variable name as an argument. The following three line program inputs and displays an integer value: get x % inputs an integer value from keyboard and

% stores in the symbol table (ST) as X

put x % fetches the current value of X from the symbol table

% and prints to the console

halt

Running this short program in an interpreter would produce this interaction: enter x > 123 // this is user input

x = 123

program halted

The WIC uses a stack-based architecture, so intermediate values can be pushed from the symbol table onto the stack and popped off the stack and stored back in the symbol table. Consider the program: get A

get B

push A

push B

pop A

pop B

put A

put B

halt

Running this program in an interpreter would produce this interaction: enter a > 123

enter b > 456

a = 456

b = 123

program halted

Notice the values for a and b are swapped; this is a characteristic of a stacks last in – first out (LIFO) behavior.

There are four arithmetic instructions: add, sub, mul, div. There all act in the same

fashion:

Pop the right hand operand off the stack

Pop the left hand operand off the stack

Perform the operation

Push the result back onto the stack Here is a sequence of instructions that inputs values for a and b, calculates the value of a2 – b2, and writes that value to the console.

Page 5: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 5

get A

get B

push A

push A

mul

push B

push B

mul

sub

pop Result

put Result

halt

The above program produces the following interaction: enter a > 5

enter b > 3

Result = 16

program halted

Exercise2.1a: Write the WIC to read the value of X, evaluate the function f(x) = x2 – 2x + 3 and then prints the result.

There are six test instructions based on the six possible comparisons: tsteq (=),

tstne (<>), tstlt (<), tstle (<=), tstgt (>), tstge (>=). Each of

these instructions is based on a comparison with zero and operates in the following manner:

Pop the value off the top of the stack

Perform the indicated comparison of that value with 0

Either push a 0 or a 1 onto the top of the stack based on whether the comparison is false or true

For example, if the top of stack is -5 and the comparison is tstlt, then the -5 is replaced by a 1. When Wren code is compiled, the test instruction is followed immediately by a conditional jump instruction, so this discussion will be expanded after labels and the jump instructions are introduced.

The label instruction has the form <label #> label, such as L1 label. All label numbers

must be unique throughout the program. Labels are not executable instructions; they are

simply targets for a conditional jump (jf) or an unconditional jump (j) instruction. The

unconditional jump has the form j <label #>, such as j L1, and, as indicated by the name,

always cause the program to jump to the specified label location at all times. The

unconditional jump has the form jf <label #>, such as jf L2. Think of jf as jump on

false. This instruction pops the value off the top of the stack; it is normally a 0 or 1 based on a previous test instruction. If the value is 0 (false) then the jump is executed; if the value is true (1) then there is no jump and program execution falls through to the next instruction. To illustrate the use of the test and jump instructions, consider a program that reads in values for a and b, assigns to max the larger of the two values, then outputs max. Here is the entire code sequence with comments:

Page 6: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 6

get A % input A

get B % input B

push A % A is on top of the stack

push B % B is on top of the stack with B below it

sub % A & B are popped, A-B is pushed onto the stack

tstlt % pops A-B, tests if A–B < 0 or, in other words,

% if A < B; pushes 0 to 1 onto stack

jf L1 % jump to L1 if A >= B (false was on stack)

push B % else A < B and B is the max

pop MAX % MAX has the same value as B

j L2 % jump unconditionally to L2

L1 label % jumped to here because A >= B

push A % A is the max

pop MAX % MAX has the same value as A

L2 label % jumped to here if B was the MAX

put MAX % output the value of MAX

halt

The above program produces the following interaction: enter a > 5

enter b > 3

Max = 5

program halted

Here is a case where b is the larger value: enter a > 5

enter b > 8

Max = 8

program halted

The above Wren Intermediate Code is equivalent to the Wren code read a;

read b;

if a < b then

max = b

else

max = a

end if

write max

The label structure for the two alternative if command is: <code for Boolean expression and test>

jf L1 % jump to else when false

<code for the true alternative of if command>

j L2 % jump unconditionally to L2

L1 label % the else code is next

<code for the false alternative of if command>

L2 label % end if is here

Page 7: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 7

Exercise2.1b: Specify the label structure for a single alternative if command. There are two logical binary operations: and and or. They both perform the following sequence of operations:

• Pop the right hand operand off the stack, assumed to be 0 or 1 • Pop the left hand operand off the stack, assumed to be 0 or 1 • Perform the logical operation, either and or or • Push the result back onto the stack

The not operation is much easier to implement. Assume there is a logical value, 0 or 1, on top of the stack. If the value is 0, not changes it to a 1; if the value is 1, not changes it to a 0. Consider the Boolean expression (X > 0) and (X < 5). This would be implemented by the following sequence of Wren Intermediate Code: push X % X is on top of the stack

push 0 % 0 is on top of the stack with X below it

sub % X & 0 are popped, X-0 is pushed onto the stack

tstgt % pops X, tests if X > 0, push result onto stack

push X % X is on top of the stack

push 5 % 5 is on top of the stack with X below it

sub % X & 5 are popped, X-5 is pushed onto the stack

tstlt % tests if X-5 < 0, push result onto stack

% the top of the stack is the Boolean X < 5

% the next to top is the Boolean X > 0

and % the top two values and popped and the logical

% and is pushed; (X>0) and (X<5) is on the stack

Exercise2.1c: Write the WIC for the Boolean expression (Y <= 5) or (Y >= 10) One final example is presented that illustrates how to implement loops in WIC. The problem to be solved is to find the number of binary digits in a positive number. For example, the number 50 would require six binary digits (110010). The algorithm continues to divide the number by 2 and count how many divisions were needed until the number becomes zero. read num;

count := 0;

while num > 0 do

num := num / 2;

count := count + 1

end while;

write count

Converting to WIC the code is: get num % read num

push 0 % count := 0

pop count

Page 8: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 8

L1 label % top of while loop

push num % test num > 0

push 0

sub

tstgt

jf L2 % jump out of loop if num <= 0

push num % num := num / 2

push 2

div

pop num

push count % count := count + 1

push 1

add

pop count

j L1 % jump unconditionally to top of loop

L2 label % exit point from while loop

put count % write count

halt

The label structure for the while command is: L1 label % top of the while loop

<code for Boolean expression and test>

jf L2 % jump out of loop if false

<code for the body of the while command>

j L1 % jump unconditionally to top of loop

L2 label % end while is here

Exercise2.1d: Propose a format in Wren introducing new keywords if needed for a loop structure that tests the exit condition at the bottom of the loop; it will exit on false. Devise a label structure in WIC for a do…while loop.

2.1.4 Desktop Compilation Here is the gcd program presented earlier in this section. The declarations do not generate any executable code. First translate the input/output commands and the halt at the end. program gcd is

var m,n : integer;

begin Wren Intermediate Code

read m; get m

read n; get n

while m <> n do <<code for while command>>

if m < n then

n := n – m

else

m := m - n

end if

Page 9: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 9

end while;

write m put m

end halt

Next work on setting up the structure for the while loop. begin Wren Intermediate Code

read m; get m

read n; get n

L1 label

while m <> n do push m

push n

sub

tstne

jf L2

if m < n then <<code for if command>>

n := n - m

else

m := m - n

end if

end while; L2 label

write m put m

end halt

Next work on setting up the structure for the if command. begin Wren Intermediate Code

read m; get m

read n; get n

L1 label

while m <> n do push m

push n

sub

tstne

jf L2

if m < n then push m

push n

sub

tstlt

jf L3

n := n – m <<code for assignment>>

j L4

else L3 label

m := m – n <<code for assignment>>

end if L4 label

end while; L2 label

write m put m

end halt

Page 10: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 10

Finish up by coding the assignments inside the if command. begin Wren Intermediate Code

read m; get m

read n; get n

L1 label

while m <> n do push m

push n

sub

tstne

jf L2

if m < n then push m

push n

sub

tstlt

jf L3

n := n – m push n

push m

sub

pop n

j L4

else L3 label

m := m – n push m

push n

sub

pop m

end if L4 label

end while; L2 label

write m put m

end halt

In chapter 7 you will build a code generator for Wren that will do this translation process. Exercise2.1e: Hand compile the product program into Wren Intermediate Code. 2.2 Specifying Syntax: BNF and other techniques Designers of programming languages have ideas about how the language should look and be put together. For example, one designer may think the looping construct should begin with a “while” keyword but another may like the “repeat” keyword. Also, there are rules about structure; such as, should a loop test come before the loop body (top-tested) or after the loop body (bottom-tested)? To be precise, language designers need a clear way to specify these types of syntax decisions. This section describes the ways that syntax can be specified.

Page 11: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 11

2.2.1 Backus-Naur Form (BNF) and Chomsky’s Context-Free Grammars In the late 1950’s John Backus, fresh off his difficult but successful implementation of the Fortran programming langauge, developed a formal notation to specify the syntax of the new programming language ALGOL 58. Building on this, Peter Naur provided some improvements in the notation to specify ALGOL 60. Ever since that time, this Backus-Naur Form – or more commonly just BNF – has been the predominant technique used by language designers to specify syntax and grammar rules. In one of those rare times when the same thing is discovered independently by different people, Noam Chomsky developed a theory of language classifications in the mid-to-late 1950’s as part of his linguistic work with natural languages. One of Chomsky’s languages classes was the context-free grammars and this is nearly identical to the BNF. A language specification using either BNF or context-free grammars consists of four components: (1) a set of terminal symbols (also called the alphabet), (2) a set of non-terminal symbols, (3) a set of productions (also called rewrite rules), and (4) one non-terminal symbol designated as the starting symbol. This is written in a mathematical way as G = (T, N, P, S). The terminals are the most basic units of the grammar and consist of the symbols and words of the language. A terminal symbol is frequently called a token. Example tokens include keywords, such as “while”, as well as operators, such as “+” and “!=”. Actually, there is a slight difference between the keyword or symbol a programmer types into a program file and its token. The actual word in the file “while” or “!=” is called a lexeme. To represent this lexeme in our specification we speak of the “while token” or the “not equal to token”. We might also refer to it as WHILE_TOK. We might also simply use a boldface font, such as while to mean the while token. It is a little confusing because there is only one allowable lexeme for such tokens. But it becomes clearer when you consider lexemes like these: normalBodyTemp, 98.6, freezingPointFahrenheit, or 32. You can see that we need a single IDENTIFIER_TOK to represent all the allowable names such as normalBodyTemp. Similarly, you might use the tokens REALNUM_TOK and INTEGERNUM_TOK to represent all the many real and integer number lexemes that could be encountered in a program.

Blast to the Past Lazy Programmer Launches Age of Software!

In the early 1950s programmers were writing programs in assembly language. There were only a handful of instructions (the Edsac and IBM 701 only had 31) that were typically mnemonics, for example A might mean Add and P mean Print. John Backus worked on the IBM 701 and wanted a better way, a more naturally expressive way to program. So he proposed a “high-level language” called Fortran for the upcoming IBM 704 machine. Using Fortran the number of programming statements necessary for a program decreased by a factor of 20 and productivity soared as the age of software was launched! Backus claims that it was laziness of not wanting to write so much code that propelled him to desire something better than assembly programming. However, it took his IBM team a massive 18 person-years of effort to construct the Fortran I language compiler[CitePadua] between 1954-1957. A primary reason for the difficulty was the informal language specification. So, perhaps this same laziness caused him to develop BNF to decrease the time to specify and implement a language design. It worked! There have been thousands of languages developed since Backus launched the software age.

Page 12: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 12

The terminals are the base units, or the alphabet symbols, of our language, whereas the non-terminals are at a higher level. If the terminals are like words then the non-terminals are like sentences. Typically a non-terminal is given a descriptive name and enclosed in angled brackets, for example <variable list>, <declarartion>, or <type>. A production is the mechanism that defines the structure of a non-terminal. For example, if a language defines two terminal keywords for data types to be INTEGER_KEYWORD_TOK (representing the lexeme “int” perhaps) and BOOLEAN_KEYWORD_TOK (“boolean”), then we can define the <type> non-terminal with these two productions: <type> ::= INTEGER_KEYWORD_TOK <type> ::= BOOLEAN_KEYWORD_TOK where the ::= symbol is a production-writing meta-symbol that means “is defined by.” From this example you can see that productions have the following structure: a single non-terminal on the left side of the definiton meta-symbol and a sequence of non-terminals and/or terminals on the right side. Typically we use the meta-symbol “|” meaning “or” to join these distinct productions on separate lines into a single line (but there are still two definitions of the non-terminal): <type> ::= INTEGER_KEYWORD_TOK | BOOLEAN_KEYWORD_TOK Now, you can write the definition for a variable declaration for a language like C: <declaration> ::= <type> <variable list> SEMICOLON_TOK It is common to improve the readability of a specification by not using so many capital letters. Thus, IDENTIFIER_TOK is made more readable by making it lowercase and using a bold font instead of the suffix _TOK to indicate its role, as in identifier. For symbols such as PLUS_TOK and SEMICOLON_TOK we can use the symbol itself giving + and ; as token indications. Let’s look at the <variable list> non-terminal because it shows how to have a sequence of items this is of an unspecified length. A variable list consists of a single variable or a comma separated list of variables. The production <variable list> ::= identifier only allows for a single variable. This production <variable list> ::= identifier , identifier allows for variable list of two variables. You could define a few more rules like this to allow for variable lists of 3, 4, and 5 variables. But how can one allow a programmer to have a variable list as long they want? We use a type of inductive or recursive definition! The base case is a single identifier and subsequent cases build on this. Here is this solution: <variable list> ::= identifier <variable list> ::= identifier , <variable list> or even more succinctly like this: <variable list> ::= identifier | identifier , <variable list> This means that a variable list such as “a, b, c” in a program is really seen as the identifier “a” followed by a comma followed by another variable list (e.g., “b, c”). The last component of a language specification is the one, special non-terminal symbol that represents the entirety of a program, such as <program>.

Page 13: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 13

Example 2.2.A Since a whole programming language is rather complex, let us begin with a simpler language: the language that is all U.S. telephone numbers. Then, G2.2.A = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, (, -, ) }, N = {<PhoneNumber>, <CountryCode>, <AreaCode>, <Prefix>, <Extension>, <Digit>}, P = { <PhoneNumber> ::= <CountryCode> <AreaCode> <Prefix> - <Extension> <PhoneNumber> ::= <AreaCode> <Prefix> - <Extension> <CountryCode> ::= <Digit> <Digit> <AreaCode> ::= ( <Digit> <Digit> <Digit> ) <Prefix> ::= <Digit> <Digit> <Digit> <Extension> ::= <Digit> <Digit> <Digit> <Digit> <Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <PhoneNumber>. Notice that the two phone number productions act to make country code an optional element. Also, our language grammar specifies that the area code digits must be enclosed in parentheses. Exercise 2.2.A Is the telephone number 800-555-1212 legal according to our grammar definition G2.2.A? Justify your answer. Exercise 2.2.B Can you give an example of a telephone number that is legal according to G2.2.A? 2.2.2 Grammars and Languages We use BNF to specify a language, or more properly a language grammar. You might be wondering: what is a language really, and what is the relationship between language and grammar? A language is defined as a set of strings. We can define L1={“a”} as the language that consists of a single string and that string consists of the single letter “a”. L2={“a”,”b”} consists of two strings. Of course, we are interested in more complex languages than these! What about the language that is all U.S. telephone numbers? G2.2.A defined a grammar for this language and the notation L(G2.2.A) means “the language defined by G2.2.A”. It is important to realize that each valid telephone number, for example (800)555-1212, is a single “string” in the set L(G2.2.A). Thus, if GJava is the grammar that defines the Java programming language then L(GJava) is the language of all valid Java programs and each complete and valid Java program is a single “string” in L(GJava). It is interesting to note that a grammar can be used to generate or recognize strings in the associated language. Exercise 2.2.A asked you to use the grammar to recognize if the given string was in or not in the language. Whereas, Exercise 2.2.B asked you to generate a valid telephone number using the grammar. Language design is generally more interested in the recognition aspect since a compiler or interpreter will be necessary.

Page 14: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 14

Because G2.2.A is a small grammar you likely answered Exercises 2.2.A and 2.2.B by just glancing at the grammar. But a modern programming language might have hundreds of productions! Also, we’d like to codify the process of recognizing the validity of a string, which brings us to the concept of a derivation. A derivation is sequence of steps that begins with the starting non-terminal symbol of the grammar, S, and ends with a string (a sequence of terminal symbols). If this string is the one you are trying to recognize then that string is in the language. If a string can not be derived from S in any way then that string is not in the language. Each derivation step replaces a non-terminal symbol with the right-hand side of one of its productions. Example 2.2.B Using G2.2.A let us perform a derivation of the string (800)555-1212. The starting non-terminal, S, is <PhoneNumber>.

Always start with S: Step 1: <PhoneNumber> No country code in (800)555-1212 so replace with second <PhoneNumber> production: Step 2: <AreaCode> <Prefix> - <Extension> Replace <AreaCode>: Step 3: ( <Digit> <Digit> <Digit> ) <Prefix> - <Extension> Replace area code digits: Step 4: ( 8 <Digit> <Digit> ) <Prefix> - <Extension> Step 5: ( 8 0 <Digit> ) <Prefix> - <Extension> Step 6: ( 8 0 0 ) <Prefix> - <Extension> Replace <Prefix>: Step 7: ( 8 0 0 ) <Digit> <Digit> <Digit> - <Extension> Replace prefix digits: Step 8: ( 8 0 0 ) 5 <Digit> <Digit> - <Extension> Step 9: ( 8 0 0 ) 5 5 <Digit> - <Extension> Step 10: ( 8 0 0 ) 5 5 5 - <Extension> Replace <Extension>: Step 10: ( 8 0 0 ) 5 5 5 - <Digit> <Digit> <Digit> <Digit> Replace extension digits: Step 11: ( 8 0 0 ) 5 5 5 - 1 <Digit> <Digit> <Digit> Step 12: ( 8 0 0 ) 5 5 5 - 1 2 <Digit> <Digit> Step 13: ( 8 0 0 ) 5 5 5 - 1 2 1 <Digit> Step 14: ( 8 0 0 ) 5 5 5 - 1 2 1 2 This matches our initial string! Notice that the same derivation steps would be used for many other strings such as (866)111-2222.

Page 15: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 15

Example 2.2.C Using G2.2.A let us perform a derivation of the string 01-800-555-1212. 1: <PhoneNumber> 2: <CountryCode> ( <AreaCode> ) <Prefix> - <Extension> Already, it can be seen that the given string is NOT in the language. All valid strings use parentheses to surrond the area code and these symbols are not present in the given string. Example 2.2.D Using G2.2.A let us perform a derivation of the string 1(800)555-1212. 1: <PhoneNumber> 2: <CountryCode> ( <AreaCode> ) <Prefix> - <Extension> 3: <Digit> <Digit> ( <AreaCode> ) <Prefix> - <Extension> 4: 1 <Digit> ( <AreaCode> ) <Prefix> - <Extension> At this point in the derivation, it can be seen that the given string is NOT in the language. There is no way to resolve the “missing” digit before the area code’s parenthesis. We can consider a derivation in a more graphical way called a parse tree or a syntax tree. Here is the parse tree for Example 2.2.B. It should be noted that a single string, such as (800)555-1212, can have many different derivations. The differences arise in decided which non-terminals to replace at each step. A left-most derivation always chooses to replace the left-most non-terminal. The derivation shown in Exercise 2.2.B is a left-most derivation. However, there is only one parse tree for this string. Look again at the parse tree above, one can not tell if <AreaCode> was replaced before or after the <Extension> non-terminal. Exercise 2.2.C Perform a right-most derivation of (800)555-1212 using G2.2.A.

Page 16: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 16

2.2.3 Specification Problems The use of BNF simplifies language design by providing a clear specification of the grammar syntax. But there do remain some potential pitfalls nonetheless. The most important issue is to ensure that the grammar is not ambiguous. An ambiguous grammar is defined as one allowing for the construction of more than one parse tree for a given string. An ambiguous grammar would mean that we can not codify the derivation process in a determinisitic way. For G2.2.A used above, there is no ambiguity; there is only one way to construct a parse tree. Example 2.2.E Let G2.2.E be a grammar that describes simple arithmetic expressions. Then, G2.2.x = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / }, N = {<Expression>, <Operator>, <Number>, <Digit>}, P = { <Expression> ::= <Number> <Expression> ::= <Expression> <Operator> <Expression> <Operator> ::= + | - | * | / <Number> ::= <Digit> | <Digit> <Number> <Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <Expression>. Exercise 2.2.D Perform a left-most derivation using G2.2.E for the string “2+3*4”. The figure below shows two distinct parse trees for the arithmetic expression 2+3*4. Thus, G2.2.E is said to be an ambiguous grammar.

The ambiguity in the case of Exercise 2.2.D stems from the decision on how to replace <Operator> the first time. If one chooses + then the left parse tree in Figure 2.2.A results. If one chooses * then the right parse tree is the result. Programmers, of course, know this is an issue of operator precedence. We need to rewrite the grammar to enforce our notion of precedence by accomplishing two objectives. First, the derivation must be forced to choose + first so that * will bind to its operands earlier and then the addition operation’s second operand is the result of the multiplication. Second, once * is chosen it must not be

Page 17: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 17

possible to get back to the lower precedence operators. To achieve these results, new non-terminals are introduced that form a stair-stepping set of productions mimicing the precedence relationships. Example 2.2.F A new simple arithmetic expression grammar that accounts for operator precedence. G2.2.F = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / }, N = {<Expression>, <Term>, <Number>, <Digit>}, P = { <Expression> ::= <Term> <Expression> ::= <Term> + <Expression> <Expression> ::= <Term> - <Expression> <Term> ::= <Number> <Term> ::= <Number> * <Term> <Term> ::= <Number> / <Term> <Number> ::= <Digit> | <Digit> <Number> <Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <Expression>. Exercise 2.2.E Explain why it is that only one parse tree exists for “2+3*4” using G2.2.F grammar. Exercise 2.2.F Draw the parse tree for this similar expression that shuffles the operators: 9/3-1. Operator associativity is another issue that can cause ambiguity in grammars. Notice that two parse trees are possible for the expression “1-2-3” using G2.2.F. One parse tree binds the first subtraction operator to 1-2 (yielding -1 as a result) and this result is part of the second subtraction operation: (-1) – 3 which yields -4. Another parse tree binds 2-3 together yielding the result -1 and this result is part of the first subtraction operation: 1 – (-1) yielding 2. Programmers are accustomed to these operators being left-associative meaning that 1-2 should be bound together initially. We must rewrite the grammar to ensure this derivation behavior by choosing to make the “recursion” happen on the left side of the operator. Example 2.2.G Our final simple arithmetic expression grammar that accounts for operator precedence and associativity. G2.2.G = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / }, N = {<Expression>, <Term>, <Number>, <Digit>}, P = { <Expression> ::= <Term> <Expression> ::= <Expression> + <Term> <Expression> ::= <Expression> - <Term> <Term> ::= <Number> <Term> ::= <Term> * <Number> <Term> ::= <Term> / <Number>

Page 18: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 18

<Number> ::= <Digit> | <Digit> <Number> <Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <Expression>. There are other causes for ambiguity in grammars than precedence and associativity, but these are very common. As has been shown sometimes the grammar can be rewritten to eliminate the cause, but sometimes not. 2.2.4 Other Techniques We prefer the classic BNF specification for its simplicity and clarity. However, there are other specification techniques which we will only briefly demonstrate. Syntax diagrams are a graphical depiction of grammar productions. These two syntax diagrams for the <Digit> and <Number> productions of G2.2.G should convey the general idea. The diagram on the left specifies the <Digit> non-terminal; notice how the multiple productions are handled as choices. The diagram on the right specifies the <Number> non-terminal; notice how the repetition is handled as a back-edge.

EBNF is an extended BNF that reduces the number of productions and non-terminals by introducing new meta-language notations. For example to allow multiple digit numbers G2.2.G contains the two productions

<Number> ::= <Digit> | <Digit> <Number>. The recursion in the production allows for the mulitiplicity of digits, and the recursion base case is the <Number> ::= <Digit> production. EBNF uses a new meta-symbol “*” to indicate 0 or more occurrences of a symbol and the meta-symbol “+” indicates 1 or more occurrences. Thus, the two BNF <Number> productions necessary to have single- and/or multiple-digit numbers can become this one EBNF production:

<Number> ::= <Digit>+ where the “+” is not a terminal symbol but an EBNF meta-symbol. EBNF includes several other meta-symbols.

Page 19: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 19

2.2.3 The BNF for Wren As a reminder of what the syntax of Wren looks like, review the programs gcd, product, and quotient in sections 2.1.1 and 2.1.2. We will first use BNF notation to describe the syntax of Wren.

Page 20: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 20

This grammar is expressed in the recursive format characteristic of BNF. For example the production for a sequence of commands is <command seq> ::= <command> | <command> ; <command seq> We can tell from the BNF that the semicolon in Wren is used to separate commands and not to terminate commands; see the exercise below. Exercise 2.2G: Write a production rule for <command seq> that uses a semicolon to terminate every command (as contrasted with using it as a separator). Exercise 2.2F: Can a command sequence be empty according to this BNF? What provision is provided in this grammar to handle an “empty” command sequence? Most programming languages define a list of reserved words that cannot be used as identifier names. The production rule below specifies the reserved words in Wren.

Exercise 2.2G: Specify the leftmost derivation for the following Wren program: program absolute is

var m : integer;

begin

read m;

if m < 0 then

write -m

else

write m

end if

end

Page 21: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 21

2.3 Lexical Analysis and Parsing Section 2.2 presented BNF as a formal mechanism for specifying a grammar and that a specification can be used as the basis for recognition of a given string or generation of valid string. Recognition involved performing a derivation from the start symbol to the given string using the grammar productions to systematically remove non-terminal symbols. This section describes in more detail how this recognition process is actually encoded as an algorithm. A programmer types a program into a file and then runs an application, such as a compiler or interpreter, to recognize if the program is valid. This recognizer reads the program from the file character by character. Thus the first step, lexical analysis, is to bundle characters together again into tokens. For example, a C programmer may type the integer type keyword “int” into their program. The lexical analyzer reads characters and comes across these particular three sequential characters: i, n, and t. It tokenizes these together into a token, perhaps INT_TOK. Of course, it’s not quite that simple since the “int” occuring in names such as “interiorNode” or “sprint” are not INT_TOK tokens. The parser gets tokens from the lexical analyzer and performs the derivation. 2.3.1 Lexical Analysis The process of tokenizing can be mapped to the operation of a finite state automaton (FSA). We know that the integer keyword token in C is comprised of the three letters: i, n, and t. Here is the simple FSA for this token. Notice the “final” state is indicated by a double circle and that reading characters from the input is the basis of a transition.

Exercise 2.3.A Draw the FSA for the not-equals token of Wren (“<>”). Both of these examples are straightforward because the tokens have a fixed length. Identifiers in Wren are a sequence of alphanumeric characters that begins with a letter. Here is the FSA for the identifier token. Notice how the FSA can loop repeating state 2 for consecutive letters or can alternate between states 2 and 3 repeatedly.

Page 22: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 22

It is not difficult to continue making an FSA, like these, for each token of a grammar. The difficulty arises because all of these FSAs need to be combined together. For instance, if the lexical analyzer reads the letter “i” should it use the INT_TOK automaton or the identifier FSA? Clearly, after only seeing the letter “i” the lexical analyzer doesn’t know which of these two automatons may be the right one. Here is an FSA that combines these two together. We are being intuitive about FSA construction, but there is quite a lot of theoretical treatment of this subject [cite,cite].

This brings up the two primary difficulties that must addressed by the lexical analyzer. First, what should be done if the lexical analyzer reaches an FSA “final” state. In the combined FSA, consider the action after reading the first three letters of the identifier “interiorNode”. Should the lexical analyzer match the INT_TOK in state 4 or keep going to match the entire identifier? Typically we attempt to match the longest possible token. The second problem is how to proceed if two FSAs both match a given sequence of characters. Notice that in state 4, “int” satisfies the INT_TOK automaton and the identifier automaton. In this case the lexical analyzer will consult a priority list. Typically we would give INT_TOK a higher priority than identifier. As described in section 2.2, a token represents a class of lexemes. Some tokens represent only a single lexeme, “int” for INT_TOK. Other tokens represent an infinite number of lexemes, for example identifier. The lexical analyzer is primarily tasked with combining characters into tokens, but later phases of a compiler or interpreter will require additional knowledge. For example, when machine code needs to be generated a compiler will need to know which specific variable (represented generically by identifier token) occurs in the program statement. Thus a token typically has a secondary piece of information that indicates the program’s lexeme. We might say identifier(i) to indicate an identifier token whose associated value is the lexeme “i”. Similarly, num(0) would indicate a num token with associated value of 0.

Page 23: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 23

2.3.2 Lexical Analysis of Wren In section 6.4 a lexical analyzer for Wren will be developed using Prolog. Below is the output of that lexical analyzer for the gcd program discussed in Section 2.1.1. Notice that the program simply becomes a list, or sequence, of tokens. >>> Interpreting: Wren <<<

Enter name of source file:

gcd.wren

program gcd is

var m,n : integer;

begin

read m; read n;

while m <> n do

if m < n then n := n - m

else m := m - n

end if

end while;

write m

end

Scan successful

[program,ide(gcd),is,var,ide(m),comma,ide(n),colon,integer,

semicolon,begin,read,ide(m),semicolon,read,ide(n),semicolon,

while,ide(m),neq,ide(n),do,if,ide(m),less,ide(n),then,ide(n),

assign,ide(n),minus,ide(m),else,ide(m),assign,ide(m),minus,

ide(n),end,if,end,while,semicolon,write,ide(m),end,eop]

The final token, eop, stands for end of program and is inserted by the lexical analyzer. Exercise 2.3B: Using the format shown above, give the list of tokens produced by a successful scan of the Wren program product that is given in Section 2.1.2. 2.3.4 Top-down Parsing The parser, also called the syntax analyzer, is the component of the recognizer that gets tokens from the lexical analyzer and verifies the syntactic structure specified by the BNF productions. To parse a given string, the parser essentially performs actions that follow the steps of a derivation of the string. A top-down parser begins at the “top” of the derivation so that the first parsing step is the first derivation step. For this reason, these parsers are intuitive to understand; however, they are more limited in the grammars that they can parse.

Page 24: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 24

Top-down parsers follow the steps of a left-most derivation, replacing non-terminals as allowed by grammar productions and matching terminals in the derivation with the terminals in the string being parsed. A parsing stack is used to hold the terminals and non-terminals. Each step examines the top of the stack and possibly the next input character. If the top of the stack is a non-terminal then it is replaced using a production. If the top of the stack is a terminal then it is compared with the next input symbol. If they match then the parse is going well and the symbol is popped from the stack and removed from the input. If the stack becomes empty and the input symbols are exhausted then the parse was successful; that is, the input string is in the language defined by the grammar. An unsuccessful parse is realized if any of these situations occur:

A non-terminal on the stack can not be replaced

A terminal on the stack is not matched by the next terminal in the input

The stack becomes empty but the input has not be completely used Example 2.3.A Perform a predictive, top-down parse of “(800)555-1212” using G2.2.A = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, (, -, ) }, N = {<PN>, <CC>, <AC>, <P>, <E>, <D>}, P = { 1: <PN> ::= <CC> <AC> <P> - <E> 2: <PN> ::= <AC> <P> - <E> 3: <CC> ::= <D> <D> 4: <AC> ::= ( <D> <D> <D> ) 5: <P> ::= <D> <D> <D> 6: <E> ::= <D> <D> <D> <D> 7-16: <D> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <PN>.

Parsing Stack (top of stack on left)

Input stream

Commentary

<PN> (800)555-1212 Initial state

<AC><P>-<E> (800)555-1212 Replaced <PN>: Two possible productions but next input character is ( so choose production #2.

(<D><D><D>)<P>-<E> (800)555-1212 Replaced <AC> by production #4

<D><D><D>)<P>-<E> 800)555-1212 Top of stack is a terminal that matches next input character: pop stack, read character.

8<D><D>)<P>-<E> 800)555-1212 Replaced <D>: Ten possibilities but input character suggests using <D>::=8.

<D><D>)<P>-<E> 00)555-1212 Matched top of stack with input: pop and read.

0<D>)<P>-<E> 00)555-1212 Replaced <D>

<D>)<P>-<E> 0)555-1212 Stack matches input: pop,read.

0)<P>-<E> 0)555-1212 Replaced <D>

)<P>-<E> )555-1212 Stack matches input: pop,read.

<P>-<E> 555-1212 Stack matches input: pop,read.

<D><D><D>-<E> 555-1212 Replaced <P>

5<D><D>-<E> 555-1212 Replaced <D>

Page 25: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 25

<D><D>-<E> 55-1212 Stack matches input: pop,read.

5<D>-<E> 55-1212 Replaced <D>

<D>-<E> 5-1212 Stack matches input: pop,read.

5-<E> 5-1212 Replaced <D>

-<E> -1212 Stack matches input: pop,read.

<E> 1212 Stack matches input: pop,read.

<D><D><D><D> 1212 Replaced <E>

1<D><D><D> 1212 Replaced <D>

<D><D><D> 212 Stack matches input: pop,read.

2<D><D> 212 Replaced <D>

<D><D> 12 Stack matches input: pop,read.

1<D> 12 Replaced <D>

<D> 2 Stack matches input: pop,read.

2 2 Replaced <D>

Empty-stack End-of-file Stack matches input: pop,read.

Parse successful! Stack is empty and input consumed.

Execise 2.3.C See how the parse of “1(800)555-1212” fails using G2.2.A. Because some non-terminals are defined by more than one production, the parser must have a deterministic way to choose (predict) the correct production. This occurred in Example 2.3.A when <PN> was replaced. We “knew what to do” because we could see the parenthesis symbol in the input and could tell we should choose production #2 and not production #1. But how can we make our parser “know what to do?” We have two choices. First, just try all of the possibilities. We could have tried production #1 but would have discovered that we could not successfully parse it that way. We could then “backtrack” and try the other branch – production #2. Backtracking can work but it can be slow and a little tricky when you think that we might be faced with branches on branches on branches all possibly requiring backtracking. The second choice is to annotate each production with information that helps us make the correct choice. For instance, production #1 in our example would be annotated with the digit terminals since the <CC> non-terminal will end up as a digit symbol. Similarly, production #2 would be annotated with a left parenthesis terminal since <AC> will end up with a left parenthesis. We can compute a FIRST set for each non-terminal that acts as this annotation. FIRST(X) is the FIRST set for the non-terminal X and is defined as the set of terminals that begins all strings that can be derived from X. Here are the FIRST sets for the non-terminals of G2.2.A.

Non-terminal FIRST set Comments

<D> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } <D> can only derive 10 strings

<E> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } <E> can derive 10,000 strings but they all start with a digit: 0-9

<P> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } All of the 1000 strings derivable from <P> start with a digit

<AC> { ( } Can derive 1000 strings, but they all start with a left parenthesis

<CC> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } All 100 strings start with a digit

Page 26: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 26

<PN> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ( } All strings either start with a digit (using production #1) or a left parenthesis (using production #2).

The FIRST sets are computed in advance from the grammar and then used during the parse to predict a production. Note that for this to allow our predictive parser to work properly, there must not be any symbol that might indicate multiple productions. Returning to our example, when we needed to replace <PN> there were two possibilities: production #1 or #2. Production #1 begins with a <CC> meaning we will predict this production if the current input symbol is in FIRST(<CC>). In this example, the input symbol was the left parenthesis which is not an element of FIRST(<CC>) but is an element of FIRST(<AC>). But if a common terminal symbol had been in the FIRST(<CC>) and FIRST(<AC>) then our parser could not guarantee its prediction about which production to use to replace <PN>. This is exactly the situation that confronts our expression grammar G2.2.G = (T, N, P, S) where T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / }, N = {<Expression>, <Term>, <Number>, <Digit>}, P = { 1: <Expression> ::= <Term> 2: <Expression> ::= <Expression> + <Term> 3: <Expression> ::= <Expression> - <Term> 4: <Term> ::= <Number> 5: <Term> ::= <Term> * <Number> 6: <Term> ::= <Term> / <Number> 7-8: <Number> ::= <Digit> | <Digit> <Number> 9-18: <Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 }, and S = <Expression>. Exercise 2.3.C Compute the FIRST sets for the non-terminals of grammar G2.2.G. Notice that the digit characters are elements of FIRST(<Term>) and are also elements of FIRST(<Expression>). Consider how the predictive parser will choose to replace <Expression> if the input symbol is a digit. It will not be able to definiively choose between productions #1, #2, or #3. Similarly, consider replacing <Number> with either production #7 or #8. It can not be done with assurance. These two scenarios are common grammar problems for predictive parsers. They are not problems that make the grammar ambiguous (as with precedence in section 2.2). Rather, these problems cause difficulty in constructing a deterministic parser. In the case of <Number>, the productions have a common left prefix meaning they start identically. But we need them to be distinguishable. The solution is to left factor the appropriate productions, factoring out the commonality. In the case of <Expression>, the problem stems from the fact that productions #2 and #3 are left recursive. This is undesirable because we do not remove an input character during a parse when we replace a non-terminal on the stack. So we might have <Expression> + <Term> on the stack and wish to replace the <Expression> on the top of the stack with production #2 yielding <Expression> + <Term> + <Term> on the stack. So now we are

Page 27: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 27

back where we started: <Expression> is the top of the stack and production #2 will work! Thus, we need to eliminate the left recursion in these productions. Left Factoring Let α, β and δ be “sentential fragments” which just means they are like variables, representing any sequence of terminals and/or non-terminals. Let there be two productions for the non-terminal <A> that have α as a common left factor. <A> ::= α β <A> ::= α δ We left factor these <A> productions by:

(1) Creating a new non-terminal <A2> and using it as the new left-hand side of all the <A> productions that had the common prefix but we remove the prefix: <A2> ::= β <A2> ::= δ

(2) Add the new <A> production: <A> ::= α<A2> Notice that if there are <A> productions that do not share the common prefix then they are unaffected. Also notice that in some cases, β or δ may be empty. In this case it is typical to represent this explicitly with a meta-symbol meaning “nil”. We choose to use λ although some others use ε. Lastly, notice that only the <A> productions beginning with the prefix are affected; that is, any production like <B> ::= α β is not considered since the problem lies with the need to replace <A>, not <B>, during a parse. Of course, if <B> productions also suffer from a left common prefix problem then they will have to be factored, but independently of the <A> productions. Example 2.3.B Left factor the <Number> productions in grammar G2.2.G. The productions are:

<Number> ::= <Digit> <Number> ::= <Digit> <Number>

First we identify the prefix and other components: α = <Digit>, β = λ (nil), and δ = <Number>. Now we apply the rules to get:

<Number2> ::= λ | <Number> <Number> ::= <Digit> <Number2>

Exercise 2.3.D Left factor the following three productions: <X> ::= a <Y> b c | a <Y> c b | <Y> d The second problem we discovered above, with the <Expression> productions, was left recursion. Recall that the recursion allows us to have repetition. For the <Expression> rule in G2.2.G we can have expressions like 1+2+3*4*5 that repeat an operator. However, as shown above, it causes difficulty in performing a top-down, predictive parse. Fortunately, there is a straightforward algorithm for eliminating left recursion in a grammar. Elimination of Left Recursion Let α and β be “sentential fragments” which just means they are like variables, representing any sequence of terminals and/or non-terminals.

Page 28: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 28

Let there be productions for the non-terminal <A> that are left recursive and productions for <A> that are not left recursive. <A> ::= <A> α <A> ::= β We eliminate the left recursion by modifying these <A> productions:

(1) Create a new non-terminal <A2> and give it the production <A2> ::= λ. (2) For all the productions that are not left recursive, append the new non-terminal:

<A> ::= β <A2> (3) For all the productions that are left recursive:

a. Remove the left recursive symbol b. Append the new non-terminal c. Change the left-hand side of the production to the new non-terminal

<A2> ::= α <A2> Intuitively we eliminate the left recursion by adding a new non-terminal that is right recursive. Here is a recap of what happened to the productions. <A> ::= <A> α <A> ::= β

became

<A> ::= β <A2> <A2> ::= α <A2> <A2> ::= λ

Example 2.3.C Eliminate the left-recursion in the <Expression> productions of G2.2.G. Observe that there is one <Expression> production that is not recursive and two that are recursive, leading to the following rewritten productions. <Expression> ::= <Term> <Expression2> <Expression2> ::= λ <Expression2> ::= + <Term> <Expression2> <Expression2> ::= - <Term> <Expression2> Example 2.3.D To convince ourselves that these new grammar productions yield the same language, let us perform a parse of “1+2+3” using these new <Expression> productions.

Parsing Stack (top of stack on left)

Input

Commentary

<Expression> 1+2+3 Initial state

<Term> <Expression2> 1+2+3 Replaced <Expression> with only choice

<Number> <Expression2> 1+2+3 Replaced <Term> with obvious choice (see next exercise)

1 <Expression2> 1+2+3 Replaced <Number> with only choice

<Expression2> +2+3 Stack matches input: pop,read.

+ <Term> <Expression2> +2+3 Replaced <Expression2> with only choice

<Term> <Expression2> 2+3 Stack matches input: pop,read.

<Number> <Expression2> 2+3 Replaced <Term> with obvious choice

2 <Expression2> 2+3 Replaced <Number> with only choice

<Expression2> +3 Stack matches input: pop,read.

+ <Term> <Expression2> +3 Replaced <Expression2> with only choice

<Term> <Expression2> 3 Stack matches input: pop,read.

<Number> <Expression2> 3 Replaced <Term> with obvious choice

3 <Expression2> 3 Replaced <Number> with only choice

<Expression2> Stack matches input: pop,read.

Replaced <Expression2> with nil (λ)

Page 29: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 29

Parse successful! Stack is empty and input consumed.

Exercise 2.3.E Eliminate the left recursion in the <Term> productions of G2.2.G. Recursive-descent parser We have given the basic algorithm for predictive parsing and discussed how to rewrite the most common grammar problems for predictive parsing. The most common way to implement the predictive parsing algorithm is with a recursive-descent parser. In a recursive-descent parser there is a method (procedure, function) for each non-terminal in the grammar. Within each of these methods there are if statements to choose the various productions of the non-terminal. We have access to the FIRST set information, and we have access to the lexical analyzer via a nextToken() method. Typically nextToken() reads the terminal out of the input, so we might use a currentToken variable to hold the terminal temporarily during our parsing steps. We’ll provide some pseudo-code for a few of the recursive-descent methods of our rewritten G2.2.G expression grammar to illustrate their construction. In section 7.3 we will investigate building a parser for Wren. parse() { currentToken = nextToken(); // read first input token expression(); // start symbol of grammar print “Parse successful” } expression() { if (currentToken is member of FIRST(<Term>)) { // This is the <Expression> ::= <Term> <Expression2> production term(); expression2();

} else { print “Error parsing expression” abort; }

} expression2() { if (currentToken == PLUS_TOK) { // This is the <Expression> ::= + <Term> <Expression2> production

currentToken = nextToken(); // matched the + so get next term(); expression2(); }

else if (currentToken == MINUS_TOK) { // This is the <Expression> ::= - <Term> <Expression2> production

currentToken = nextToken(); // matched the - so get next term(); expression2();

Page 30: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 30

} else { // <Expression2> ::= λ production, do nothing }

} 2.3.5 Bottom-up Parsing Bottom-up parsers are more complex than top-down parsers and so we will not cover them thoroughly. Essentially, bottom-up parsers follow the steps of a right-most derivation in reverse. This is where the bottom-up phrase comes from: if you list out all the steps of a right-most derivation then the bottom-up parser’s first action is the last step of the derivation. The second action is the next-to-last derivation step and so on back “up” the derivation from the bottom. This means that as input characters are removed the parser is not predicting what to do, but rather stores these tokens until it recognizes that it can match an entire right-hand side of a production. It then replaces all these stored symbols (terminals and non-terminals) with the production’s non-terminal. Notice this is exactly the inverse of our top-down replacement that replaced the production’s non-terminal with all the symbols of the production’s right-hand side. The bottom-up parser finishes when only the start symbol is left and the input is empty. 2.3.6 A Parser for Wren A parser takes the list of tokens from the scanner and produces an abstract syntax tree. Here is the result of parsing (in Prolog) the gcd program shown in Section 2.3.3. Parse successful

prog([dec(integer,[m,n])],[read(m),read(n),while(bexp(neq,

ide(m),ide(n)),[if(bexp(less,ide(m),ide(n)),

[assign(n,exp(minus,ide(n),ide(m)))],

[assign(m,exp(minus,ide(m),ide(n)))])]),write(ide(m))])

Notice that the parse tree contains nested lists matching the nesting found in the tree structure. The syntax of the parse is similar to the abstract production rules for Wren given in Section 2.2.3. It is possible to draw this result in the form of an abstract syntax tree. Appropriate internal node names have been introduced for the nested list structures, such as cmdSeq. The tree height has been lessened a bit by omitting the ide before identifier names. Here is the top level of this tree structure, you will be asked to complete the tree as part of an exercise. prog decSeq cmdSeq dec read read while write integer varList m n <finish> m

Page 31: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 31

m n Exercise 2.3F: Complete the abstract syntax tree shown above by finishing the tree structure for the while command. Exercise 2.3G: Using the format shown above, given the output produced by a successful parse of the Wren program product that is given in Section 2.1.2. Exercise 2.3H: Draw the abstract syntax tree for the Wren program product that is given in Section 2.1.2. The parser shown above was a program written in Prolog. This program will not be developed in this book; rather a more ambitious parser for Wren will be developed that returns Wren Intermediate Code in place of an abstract syntax tree. This code-generating parser is the case study in Section 7.3. 2.4 Generating Intermediate Code Section 2.3 discussed the lexical analysis and parsing phases of a recognition tool, such as a compiler or interpreter. Characters are read from a source file by the lexical analyzer and combined into tokens. The parser gets each token and verifies that the syntax structure of the grammar (the BNF productions) is not being violated. What happens if the input source file is determined to be valid syntactically? A compiler will generate some form of executable code; whereas, an interpreter will be executing the desired program as it parses. In either case, there is something happening beyond the parsing. Typically, the input source file program is read and parsed, and then converted into an intermediate representation. This intermediate code separates the dual concerns of the “front-end” phases of lexical analysis and parsing from the “back-end” phases of optimization and code generation. The front-end phases are intimately tied to the source language; for example, adding a keyword would cause lexical analysis and parser changes. The back-end phases are intimately tied to the target environment; for example, using 32-bit or 64-bit registers. However, the intermediate code is only marginally affected by the source or target, perhaps not affected at all. The remainder of this section describes the two predominant forms of intermediate code and discusses how these forms are used with Wren. 2.4.1 Various Forms of Intermediate Code There are two predominant categories of intermediate code: linear forms and graphical forms. Linear forms are sequences of statements that resemble assembly language. Graphical forms are hierarchical structures like trees. There is a variety of each of these forms; below two linear forms, three-address code and stack-based code, and one graphical form, abstract syntax trees are discussed in detail.

Page 32: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 32

2.4.1.1 Abstract Syntax Trees The parse trees described in section 2.2 contained a node in the tree for every symbol, terminal and non-terminal. Consider the parse tree for a single alternative Wren if-then command:

if a <> 0 then x := x / a end if

This is called a concrete syntax tree because it contains nodes for every element. Notice that once the syntax is assured, many of the nodes are not actually necessary. An abstract syntax tree (AST) is an intermediate code that only maintains the essential information necessary for further “back-end” phases. An abstract syntax tree for the same Wren single alternative if-then command used above might look like:

In the concrete syntax tree, each non-terminal represents a sub-tree node in the tree and terminals are leaves of the tree. In the AST, each node type can infer certain syntax properties; thus, the IfCommand node can infer the keywords used and their position: if, then, end, and if again. The IfCommand has two parts that can vary, the boolean expression and the then command sequence. Other AST nodes, which in turn may root a

Page 33: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 33

sub-tree, handle these two parts. Notice that the not-equal-to relation is folded into the comparison node and is graphically represented by the operator itself. The BNF productions of section 2.2 yield concrete syntax trees. It is sometimes useful to specify “abstract” productions that yield abstract syntax trees. Here are the abstract productions for Wren (notice the use of the * and + EBNF meta-symbols for repetition).

Graphical forms of intermediate code, such as abstract syntax trees, are useful because they are easy to generate during the steps of the parse. They also enable other kinds of analysis including semantic analysis. Exercise 2.4.A Using the abstract productions above, build the AST sub-tree for x := a + b * 2. 2.4.1.2 Three-address code Three-address code is a linear intermediate representation that takes on the form: x := y op z in which the three addresses are for the result (x) and the two operands (y, z). Three-address code is often implemented as a “quadruple” consisting of the three addresses and the fourth component is the operator (op). Complex expressions must be broken down into pieces and this introduces temporary variables. Thus, the source language expression x := a + b * 2 would be represented in three-address code as:

t1 := b * 2 x := a + t1

and, as quadruples, this expression might look like: (*, t1, b, 2) and (+, x, a, t1).

Page 34: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 34

Other language constructs may not need to use all of the quadruple fields. So, the Wren input statement “read n” could have the three-address quadruple (read, n, nil, nil). Because the three-address code is a linear form, there is a sequential ordering. After introducing a label “instruction” and then control flow is accomplished by jumping to labels. The Wren conditional statement from above: if a <> 0 then x := x / a end if might have the following as a three-address code representation:

(<>, t1, a, 0), // perform the test (jump-false, t1, Label1, nil), // if test failed(t1 is false) jump over then block (/, x, x, a) // here is the then block (label, Label1, nil, nil) // end of the if-then

Linear forms of intermediate code are useful since they very closely match most target languages such as assembly code. Moreover, there are many optimizations that are possible on linear intermediate code.

Exercise 2.4B: Give the three address code representation for the Wren code: while n >

0 do n := n/2 end while.

2.4.1.1 Stack-based code Stack-based code is a linear form that targets a stack-based architecture rather than a standard register-based architecture. Stack-based architectures are increasing in popularity; the Java Virtual Machine (JVM) is stack-based. In a stack-based architecture temporary results are stored on a stack rather than in registers. Some instructions use two operands and both of these would be found on the stack, other instructions may use only one operand and some use no operands. Some stack-based instructions contain an “address” type of field in the instruction; particularly the control flow instructions such as jumps. The jump-false instruction needs two pieces of information, the value to test (for false) and where to jump should the test succeed. An instruction in three-address code might look like (jump-false, t1, Label1, nil). However in stack-based code, the t1 value would be found on the stack and so it might look like (jump-false, Label1). Because a field is sometimes used, stack-based is sometimes called one-address code. Example 2.4.A List the stack-based code for the example x := a + b * 2. (push, b) (push, 2) (multiply, nil) (push, a) (add, nil) (store, x) Notice the stack manipulation instructions push and store that explicitly affect the stack. The two push instructions place the operands for the multiply operation on the stack.

Page 35: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 35

Then, the mulitply operation has no arguments at all since its operands are expected to be located at the top of the stack. The result of the multiply is implicitly placed onto the stack as well. Thus, after pushing the value in the variable a onto the stack the top two values on the stack are the operands for the add instruction. Lastly, the store instruction pops a value off the top of the stack and stores this value into the variable identified in the instruction. Exercise 2.4.C Assuming a label instruction such as (label, Label1), write the stack-based code for the if-then example used earlier: if a <> 0 then x := x /a end if. 2.4.2 The Stack-Based Intermediate Code for Wren Wren Intermediate Code (WIC) was introduced in the case study in Section 2.1. The gcd program was hand compiled to produce the following WIC: get m; get n; L1 label; push m; push n; sub; tstne; jf L2;

push m; push n; sub; tstlt; jf L3; push n; push m; sub; pop n;

j L4; L3 label; push m; push n; sub; pop m; L4 label; j L1;

L2 label; put m; halt;

A human hand-compiling code can produce more efficient and more elegant code than a program that produces compiled code. This is particularly true when expressions are evaluated to generate code. To see this here is the WIC generated by the Wren code generator discussed in Section 7.3 for the gcd program. [[get,m],[get,n],[L1,label],[push,m],[pop,T1],[push,n],[pop,T2],

[push,T1],[push,T2],sub,tstne,[jf,L2],[push,m],[pop,T1],

[push,n],[pop,T2],[push,T1],[push,T2],sub,tstlt,[jf,L3],

[push,n],[pop,T1],[push,m],[pop,T2],[push,T1],[push,T2],sub,

[pop,n],[j,L4],[L3,label],[push,m],[pop,T1],[push,n],[pop,T2],

[push,T1],[push,T2],sub,[pop,m],[L4,label],[j,L1],[L2,label],

[push,m],[pop,T1],[put,T1],halt]

The first thing you notice is that temporary variables have been introduced, notably T1 and T2. This is because the code generator always assumes the most complex case possible for expressions. The BNF for the assignment command is: <command> ::= <variable> := < expr> This allows for assignments such as n := n - m. A human examining this expression would directly write the WIC: push n; push m; sub; pop n

The code generator assumes the more complex case of an integer expression on either side of an arithmetic operation, so it uses temporary variables to store intermediate results. The actual code generated is: [push,n],[pop,T1],[push,m],[pop,T2],[push,T1],[push,T2],sub,

[pop,n]

Page 36: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 36

Notice the following about the use of temporary variables:

Within the scope of an expression the temporary variables must be uniquely numbered

Once the code generation for an expression is completed the temporary variables can be re-used in a different expression

Translation of the while and if commands requires the use of labels. Labels must be uniquely numbered throughout the entire program. Here is the pattern of code generation for a two alternative if command.

The notation L<n+1> and L<n+2> assumes that n contains the value of the last used label. The value for n must be updated each time a new label is generated and must be known throughout the program. This is accomplished by using an attribute grammar that is explained in Section 2.4.3. Exercise 2.4D: Draw a similar diagram and code generation pattern for a single alternative if command. How many labels did you need? The code generated for a while command also requires two new labels.

Again, the last value of n used must be known at the time this code is generated.

Page 37: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 37

Exercise 2.4E: A do … while loop tests for continuing the loop at the bottom of the loop. This means the loop will execute at least one time. Draw a flow control diagram for the do while loop and devise a label scheme that will allow the proper semantics for the loop. How many labels did you need? 2.4.3 Using Attributes to Help Number Labels and Temporary Variables The definition of grammars using BNF is based on a context-free grammar. This means each grammar rule is independent of the other grammar rules. But most programming languages are naturally context dependent. In particular variable types are declared before variables are used. Type checking by a compiler requires that type information be known throughout the program blocks. The most local type declaration is used to determine the current type associated with a variable name. Donald Knuth proposed an ingenious mechanism, called attribute grammars, that allows information to be passed around an abstract syntax tree (AST) while retaining the simplicity of a context free grammar. There are two types of attributes possible for the transfer of a particular piece of information:

An inherited attribute is passed from a parent node to a child node; it is a way to pass information down into an AST

A synthesized attribute is passed from a child node back up to a parent node; the value of a synthesized attribute is often determined by processing an inherited attribute in some way (including not changing the value at all)

A discussion of attribute grammars to solve all the context sensitive issues associated with programming languages is beyond the scope of this book; attribute grammars as discussed below will help determine the numbers for labels and for temporary variables in WIC. Since labels must be unique throughout the intermediate code, they require the use of both an inherited attribute and a synthesized attribute. The basic strategy is to pass down the last used label number into the AST. Most nodes in the AST do not change this value and simply pass the same value back out as the synthesized attribute value. But the while command and if command in the AST will require the use of labels and will return a different value in the synthesized attributed than what was inherited.

The while command and the two-alternative if command both require two labels, so if the value n is passed in as the inherited attribute then n+2 will be returned as the synthesized attribute

The single alternative if command only requires one label, so if the value n is passed in as the inherited attribute then n+1 will be returned as the synthesized attribute

Of course nested control structures may add more labels. The following diagram shows how these attributes are threaded throughout a command sequence in a Wren program.

Page 38: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 38

A Wren code generator will be developed using Prolog in Section 7.3. Temporary variables only require an inherited attribute into an expression since they can be reused in the next expression. Consider the grammar rule: <integer expr> ::= <term> | <integer expr> <weak op> <term> Remember that weak operations include + and -. As you will learn when you study Prolog in chapters 6 and 7, left recursion in Prolog causes difficulty, so the production rule is translated into: <integer expr> ::= <term> <rest integer expr> where <rest integer expr> is right recursive. The value temp is passed into <term> and the value temp+1 is passed into <rest integer expr>. This accounts for the generated code for n := n – m to be: [push,n],[pop,T1],[push,m],[pop,T2],[push,T1],[push,T2],sub,

[pop,n]

Since temporary variables can be reused, it is not necessary to pass back out a modified value as a synthesized attribute. In the case study on code generation for Wren (Section 7.3) attributes will be used to pass in, as an inherited attribute, the code generated so far and to pass out, as a synthesized attribute, the new value for the intermediate code.

Page 39: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 39

2.5 Beyond Intermediate Code The preceding sections have shown the “front-end” phases of lexical analysis and parsing and the middle phase involving generation intermediate code and semantic analysis. At this point, the original program – in the form of intermediate code – is considered well-formed. So the question now is “what to do next?” There are really only two choices: interpret this intermediate code directly or compile the intermediate code into a natively executable form that targets a specific computer architecture. In either case, a new data structure is now needed to manage the variables used in a program. The symbol table is this structure. Typically implemented as a map, also known as a hash table, the symbol table correlates a symbol name with other necessary information such as type. 2.5.1 Compilation to Machine Code A standard compiler converts the intermediate code into machine-specific code, usually assembly code, targeting the machine that the compiler is running on currently. A cross-compiler converts the intermediate code into machine-specific code for a different machine architecture; for example, a cross-compiler running on a Intel-based Windows machine might generate final code that targets a Linux machine with an AMD processor. The details of the “back-end” phases of a compiler are beyond the scope of this text, see any of the excellent texts that focus on this subject [ASUL,others]. A brief overview is provided here. Typically, the “back-end” phases include code improving optimizations and final code generation where this code matches the target architecture. Some optimizations are considered architecture-independent and can be performed on the intermediate code. For example, the intermediate code generation introduces many temporary variables. Recall that multi-operator expressions like x:=a+b*2 become sequences of three-address code like (*, t1, b, 2) and (+, x, a, t1). Often intermediate code generation introduces temporaries for all expressions not just multi-operator expressions. It makes sense, actually, because it simplifies the intermediate code generator. Thus, the expression x:=a+b*2 likely would become this sequence of three-address code (*, t1, b, 2), (+, t2, a, t1), (:=, x, t2, nil). Similarly, the shorter expression x:=a+b might become (+, t1, a, b), (:=, x, t1, nil). Clearly, it is possible to do a better job. A code-improving optimization phase can look specifically for these types of cases. Other architecture-independent optimizations include common-subexpression elimination, strength reduction and copy propagation [cite ASUL]. When generating final code that targets a specific architecture, the compiler back-end must choose wisely from the architecture’s available instruction set. For example, many machine architectures have multiple instructions for storing a constant numeric value into a variable that depend on the number of bits needed for the value. Thus “x := 1” might use a different target architecture instruction than “x := 500” because 1 can be represented with 8 bits, but 500 requires more than 8 bits.

Page 40: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 40

Other optimizations are dependent on the target architecture. Allocating registers to reduce memory accesses depends on the number of registers provided by the architecture. Some architectures reserve registers for specific purposes such as passing parameters. Many modern architectures pipeline instruction fetch, decode, and execution. An optimizer can sometimes reorganize program statements to keep the pipeline full, and this code-scheduling depends on architecture specifics such as the number of functional units. 2.5.2 Interpretation of Intermediate Code A pure interpreter is a phase of the same program as the front-end phases, and this interpreting phase would begin to “execute” the intermediate code and process user input. A hybrid approach is a two-step process in which the first step takes the original source program, parses it, and outputs code not for a real machine architecture but rather for a virtual machine. Then an interpreter simulates the virtual machine. The UCSD Pascal system used this approach to “compile” Pascal down to p-code which was then interpreted. More recently, Java made this hybrid approach widely appreciated. Java programs are compiled into byte-codes which are then interpreted by a Java Virtual Machine (JVM). Similarly, Microsoft’s Common Intermediate Language (CIL) is the target for many of its Visual Studio .NET languages (VB, C#, F#, etc.). Interpreter execution is typically much slower (as much as a factor of 10 slower) than executing a native instruction program created by a compiler [cite ASUL]. In an attempt to offer higher performance, some interpreters offer a “just-in-time” compilation feature. Essentially, the interpreter determines that a block of code is being executed frequently enough that it would improve overall performance to compile that block into native code and execute that directly instead of re-interpreting it each time. 2.5.3 Interpreting Wren Intermediate Code Three of the case studies in this book involve writing an interpreter for WIC: an interpreter in the imperative language C (Section 3.4), an interpreter in the functional language F# (Section 5.3), and an interpreter in the object-oriented language C# (Section 8.4). All three interpreters can be written in Visual Studio which implements languages using the Common Intermediate Language so it is possible to measure interpreter performance accuracy. Most students are surprised to find out which implementation is the fastest. Whatever language is used to write the interpreter, there are many common steps in writing the code. The first phase is a sequence of preprocessing steps that need to be completed before actually running the interpreter. The following WIC for the gcd program will be used to discuss these steps. Here is the machine compiled intermediate code where, for reference, the instructions are numbered starting at zero. (0)get m (1)get n (2)L1 label (3)push m (4)pop T1 (5)push n

(6)pop T2 (7)push T1 (8)push T2 (9)sub (10)tstne (11)jf L2

(12)push m (13)pop T1 (14)push n (15)pop T2 (16)push T1

Page 41: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 41

(17)push T2 (18)sub (19)tstlt (20)jf L3 (21)push n (22)pop T1

(23)push m (24)pop T2 (25)push T1 (26)push T2 (27)sub (28)pop n

(29)j L4 (30)L3 label (31)push m (32)pop T1 (33)push n

(34)pop T2 (35)push T1 (36)push T2 (37)sub (38)pop m

(39)L4 label (40)j L1 (41)L2 label (42)push m (43)pop T1

(44)put T1 (45)halt

The first step in preprocessing is to allow the user to open the WIC file, which is a text file

with one instruction per text line. These instructions need to be stored in a sequential data

structure: a one dimensional array for C and C# and a list for F#. All the instructions,

except for label, have the form: <op code> <operand>; the operand is empty for many

instructions (tests, arithmetic operations, halt). So that all instructions have a uniform

format, the preprocessor will switch the order of the values for the label instruction. In

other words, L1 label will be converted to label L1. This now matches the <op code>

<operand> format.

A symbol table (ST) will be used to store the current values for all the variables in the

program, including all temporary variables. For Boolean variables the values are 0 for

false and 1 for true. All variables in the ST are initialized to zero. For the gcd program

given above, the preprocessor will create the following ST.

m 0

n 0

T1 0

T2 0

The program counter (PC) is initialized to start at instruction 0 and continue executing until

the halt instruction is encountered. When the instructions are executed sequentially, the

PC is incremented by one to fetch the next instruction. However when a jump instruction

is encountered and the jump taken, then the PC is changed to the location of the label

instruction. Rather than search for the location of a label instruction each time a jump is

taken, a jump table (JT) can be constructed that stores the instruction number for all

labels. The gcd program contains four labels and the preprocessor would construct the

following jump table.

L1 2

L2 41

L3 30

L4 39

Whenever an unconditional jump, such as j L4 in instruction 29, is encountered the

interpreter will look up the value for the label in the JT and put it into the PC. This will

cause to cause the interpreter to fetch the label instruction next and execute it. The label

instruction is like a no-op, so the state of the interpreter is not changed in any way, other

than to increment the program counter to the instruction following the label. When a

Page 42: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 42

conditional jump is encountered, such as jf L2 in instruction 11, the following sequence of

actions takes place:

The top of the runtime stack is popped, it should be a 0 or a 1

If the value is 0 (false) then the jump specified by the jf (jump on false) instruction is

taken; the PC is changed to the value obtained from the JT

If the value is 1 (true) then the jump specified by the jf (jump on false) instruction is

not taken; the PC is incremented to fetch the next instruction

Exercise 2.5A: If a label instruction is encountered through normal sequential program

execution it must be executed like a no-op. However, with a small change to the JT

“hitting a label” when it is encountered as a result of executing a jump can be avoided.

Describe precisely what changes would have to be made.

So far there are three data structures: the list of instructions, the symbol table, and the jump table. A program counter is needed to hold the location of the next instruction to be fetched. Since WIC is a stack-based intermediate code, one additional data structure is needed, a runtime stack for integer values. This stack is created by the preprocessor and initialized to the empty stack. Exercise 2.5B: Here is the WIC generated by the product program: [[get,a],[get,b],[push,0],[pop,p],[L1,label],[push,b],[pop,T1],

[push,0],[pop,T2],[push,T1],[push,T2],sub,tstgt,[jf,L2],

[push,b],[pop,T1],[push,b],[pop,T2],[push,2],[pop,T3],[push,T2],

[push,T3],div,[pop,T2],[push,2],[pop,T3],[push,T2],[push,T3],

mult,[pop,T2],[push,T1],[push,T2],sub,[pop,T1],[push,0],

[pop,T2],[push,T1],[push,T2],sub,tstgt,[jf,L3],[push,p],

[pop,T1],[push,a],[pop,T2],[push,T1],[push,T2],add,[pop,p],

[j,L4],[L3,label],skip,[L4,label],[push,a],[pop,T1],[push,2],

[pop,T2],[push,T1],[push,T2],mult,[pop,a],[push,b],[pop,T1],

[push,2],[pop,T2],[push,T1],[push,T2],div,[pop,b],[j,L1],

[L2,label],[push,p],[pop,T1],[put,T1],halt]

Give the symbol table and the jump table that the preprocessor would create for this program. The basic operation of the interpreter is: (1) fetch the instruction as specified by the PC (2) increment the PC in anticipation of sequential instruction execution (3) execute the current instruction; some instructions will change values in the ST or on the runtime stack, the jump instructions may change the PC value (4) if the last instruction is not halt, go to step (1) If the original program did not contain an infinite loop then the halt instruction should be eventually encountered. However, in Wren, as in other programming languages, it is possible to write a program that never terminates.

Page 43: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 43

Exercise 2.5C: Propose a mechanism to possibly detect non-terminating programs. This is not an easy task. Just because a set of instructions has executed millions of times without encountering a halt does not necessarily mean the program will never halt. So don’t propose an over simplistic solution. Exercise 2.5D: Study the Wren source code for the gcd program. What values of m and n could cause a set of instructions to be executed millions of times as part of normal program execution? Consider the input/output instructions and the halt instruction. The get instruction has the format get <variable name>; this should result in the user being prompted to enter a value, such as get m resulting in enter m >

If the value entered does not parse successfully to an integer, an error message should be displayed and the process repeated. The put <variable name> instruction, such as put m, should print out: m = <value currently in symbol table>

The halt instruction should not only stop program execution but should also print out the complete symbol table at the time the interpreter is halted. The push instruction, such as push m, transfers the current value of the specified variable from the symbol table to the top of the runtime stack. The pop instruction, such as pop n, removes the value from the top of the runtime stack and stores it in the specified variable in the symbol table. You are encouraged to develop your code incrementally and provide small WIC programs that you have entered by hand to test the instructions you have implemented so far. After implementing get, put, push, pop, and halt the following program will provide a good test: get A

get B

push A

push B

pop A

pop B

put A

put B

halt

Running this program in an interpreter would produce something like this interaction: enter A > 123

enter B > 456

A = 456

B = 123

program halted, the symbol table values are

A = 456

B = 123

The four arithmetic instructions all act in the same fashion:

Pop the right hand operand off the stack

Pop the left hand operand off the stack

Perform the operation

Page 44: Programming Languages and Paradigms, J. Fenwick, B. Kurtz ...blk/cs3490/ch02/ch02.pdf · Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris Chapter 2 Page 1 2.1

Programming Languages and Paradigms, J. Fenwick, B. Kurtz, C. Norris

Chapter 2 Page 44

Push the result back onto the stack Remember to check for a divide by zero error (the right operand for the div instruction is 0) and stop the interpreter immediately printing an appropriate error message and the current contents of the symbol table. Exercise 2.5E: Write a short interactive program that inputs two numbers, say m and n, and then prints out m+n, m-n, m*n, and m/n. The six test instructions operate in the following manner:

Pop the value off the top of the stack

Perform the indicated comparison of that value with 0

Either push a 0 or a 1 onto the top of the stack based on whether the comparison was false or true

Exercise 2.5F: Write a short interactive program that inputs two numbers, say m and n, and then prints out 0 for false or 1 for true to test the relationships m = n, m <> n, m < n, m <= n, m >= n, and m > n. The execution of the jump instructions and the label instruction is fully described earlier in this section. Here is a test program for these instructions; get A

get B

push A

push B

sub

tstlt

jf L1

push B

pop MAX

j L2

L1 label

push A

pop MAX

L2 label

put MAX

halt

The maximum of the values A and B should be output. Exercise 2.5G: Write a non-interactive WIC program using a loop structure that prints out the values 1 through 10. The final group of instructions is the logical instructions: and, or, and not. The first two pop two the operands off the stack (like the arithmetic instructions), perform the indicated operation, and push the result back onto the stack. The not instruction changes the value on top of the stack in the following manner: a 0 is changed to a 1 and a 1 is changed to a 0. If values other than 0 or 1 are encountered by these instruction, the interpreter should be stopped and an appropriate error message along with the current symbol table values should be printed.