Part III: A homemade compiler -...

Part III: A homemade compiler

! !!!h e g a p b e t w e e n

theory and practice is one oflife's recurrent

disappointments. In predicting theweather, navigating by the stars, writing afugue or fixing a carburetor, a grasp oftheory is valuable, perhaps even essential,but never quite sufficient. It is the same in

designing a compiler for a programminglanguage. Grammatical principles guidethe construction, but craft is needed too.

Parts I and II of this series focused onthe theoretical view of language. It wasshown that a language can be defined as aset of strings made up of symbols drawnfrom a specified alphabet. Rules of grammar determine which strings are sentences of the language. The grammar canbe embodied in an abstract machine calleda recognizer, which reads an input stringand either accepts it as a sentence or

rejects it as being ill-formed.The recognition machines of linguistic

theory are idealized devices, built out ofpure thought stuff. Nevertheless, an idealrecognizer can be closely approximated ina program written for a real computer.The structure of the program follows

directly from that of the underlying grammar. In some cases the transformation of a

grammar into a recognizer can even beautomated, a sure sign the process is wellunderstood.

But a recognizer is not a compiler. A

programmer wants more in the way ofoutput than a pass-or-fail indication ofwhether a program has any syntax errors;what is wanted is object code, ready toexecute on some machine. In the gener

ation of that code a formal grammar offers

only limited help. It contributes even lessto the many housekeeping tasks of the

compiler: recording the names of variables and procedures, allocating and

reclaiming storage space in the targetmachine, calculating addresses, and soon.

In this third and final article I shallexamine how theory is reduced to practicein the art of compiler writing. The discussion centers on a compiler for a toy

language I call Dada. The full programlisting for the compiler is available fromthe COMPUTER LANGUAGE BulletinBoard Service, CompuServe account, andUsers Group (see page 4).

Dada grammarIn developing Dada my one aim was clarity. Speed, efficiency, and utility were allsacrificed to keep the compiler simple.The vocabulary of the language and many

syntactic structures are borrowed fromPascal, another language whose designwas guided by a concern for ease of com

pilation. To simplify further I left outmany of Pascal's useful features, and Ihave consistently done things the easy

way, not the best way. The result is agood-for-nothing language but one with acompiler whose functioning can be tracedin detail.

Dada has only two data types: Booleanvalues (true or false) and integers. Thereare no characters, strings, arrays,records, sets or pointers, and there is nofacility for defining new types. The if. . .then . . . else statement is available to control conditional execution, and a whilestatement is provided for loops: the case,

for, repeat and goto statements arc not

-• !

included. The most serious deficiency ofDada is the absence of local variables andparameters passed to procedures. Allvariables are global.

When so much is omitted, whatremains? The most conspicuous feature ofa Dada program is its organization intoblocks of statements delimited by the keywords begin and end. The program startswith a header, which is followed by declarations of variables and procedures andconcludes with a statement block. A procedure has the same structure on a smallerscale, except that it cannot declare variables: a header is followed by further procedure declarations and a block. Procedures can be nested to any depth.

The grammar for Dada given in Table 1defines the syntax of the language in 21

production rules. A rule such as Type :: =integer \ boolean states that wherever thesymbol Type is encountered, it can bereplaced by either integer or boolean.Type and all the other symbols that appearon the left side of the rules are nonterminals, which represent categories of

symbols; only the terminals, such asinteger and boolean, appear in the actualprogram text. Any string of terminals thatcan be generated by applying the production rules is a valid Dada program.

A program is written as a linear

sequence of symbols, but under the surface it has a more elaborate structure. It isa tree: a hierarchy of nodes linked in

parent-child relationships. The arrangement of the nodes and the pattern of links

between them encode all the syntacticrelations among the elements of the

program.A grammar is a compact way of repre

senting an infinite family of trees. Nonterminal symbols label the interior nodesof each tree, with the special symbol Pro

gram at the root, and terminal symbols arehung on the leaf nodes. If the grammar isunambiguous, every program corresponds to exactly one tree. Parsing theprogram, or analyzing its syntax, is a

Dada grammar

matter of converting the one-dimensionalstring of symbols into a two-dimensionaltree.

Grammars and parsers

The most useful grammars for program

ming languages are the ones calledcontext-free grammars. They are powerful enough to describe most programming

concepts and yet are simple enough toallow efficient compilation. The definitive property of a context-free grammar

Syntactic rules

P r o g r a m : : •!D e c l a r a t i o n s : :!

V a r L i s t : : :

V a r i a b l e : :!!T.

l o c k : : :

P r o c e d u r e s : : •

P r o c e d u r e : :!

S t a t e m e n t s : :S t m t L i s t : :

Statement

A s s i g n S t m t : :I f S t m t : :

E l s e C l a u s e : : •!

W h i l e S t m t : :

ProcStmt

E x p r e s s i o n : :S i m p l e E x p r : : :T e r m : :!

SignedFactor : - . :F a c t o r : :!

Lexical rules

I d e n t i f i e r : : •L e t t e r : : :

D i g i t : :!N u m b e r : : :

S i g n : : :R e l O p : :A d d O p : :M u l t O p : : <Boolean Value ::

program Identifier; Declarations Block .var VarList | e

Variable | Variable VarList

Identifier :Type;

integer \ booleanProcedures Statements

Procedure Procedures | e

procedure Identifier; Block ;

begin StmtList endStatement | Statement; StmtList | e

AssignStmt | IfStmt | WhileStmt | ProcStmt | StatementsIdentifier := Expressionif Expression then Statement ElseClause

else Statement | e

while Expression do Statement

Identifier

SimpleExpr | SimpleExpr RelOp SimpleExprTerm | Term AddOp SimpleExpr

SignedFactor | SignedFactor MultOp TermFactor | Sign Factor

Number | BooleanValue | Identifier | (Expression)

Letter (Letter | Digit)*

A . . Z | a . . z

0 . . 9

Digit Digit*+ | - | not= I > I < I <> I > =

* | / | mod | andTrue | False

Grammar for the language Dada is defined in 21 syntactic production rules and nine

additional rules that establish lexical categories. In each rule the strings of symbols to the

right of the ":: = " sign represent all the possible expansions of the symbol to the left. The

symbol " | " is pronounced "or" and separates alternative productions. In the lexicalrules an asterisk signifies repetition zero or more times.

Table!.

is that every production rule has just one

symbol on the left side of the ":: = " sign.The grammar for Dada satisfies this condition, and so it is context-free. (The language itself is not entirely context-free,but for the time being I shall discuss it asif it were.)

The recognizing machine for a context-free language is a pushdown automaton, a

computer in which the only available storage is a stack and the only accessible itemon the stack is the one at the top. Linguistic theory offers the assurance that everycontext-free language can be recognizedand parsed by some pushdown automaton.On the other hand, the theory does not

promise that the task can be done efficiently enough to be completed in a reasonable period (say a human lifetime).

Efficiency is in question because aparser working by the most generalmethod has a vast array of choices. It can

attempt to match any production rule toany string of symbols anywhere in thesource text; if the match fails, it can go onto any other combination of rules andstrings. In the worst case the process doesnot end until all possible combinationshave been tried.

The way to avoid this combinatorial

explosion is to limit the parser's freedomof choice. One must guarantee that if the

input can be parsed at all, the correctparse will be found by reading the symbols and applying the rules in a fixed

sequence. In order to make that guarantee, further constraints must be put on thegrammar. It must not only be context-freebut must also satisfy additionalconditions.

Left and right, top and bottom

All practical parsers scan their input fromleft to right. A limited amount of back

tracking or looking ahead may beallowed, but no symbol is added to theparse tree until all the preceding symbolshave been parsed. Thus when the end ofthe input is reached, the tree must represent a complete and valid program.

The subclasses of context-free parsersarc distinguished by the sequence inwhich nodes are added to the tree. Themain division is between top-down and

bottom-up parsing methods. (The termsemploy the usual inverted frame of reference for tree structures,-putting the root atthe top and the leaves at the bottom.) A

COMPUTER LANGUAGE ! DECEMBER 1985

re r?

I 1 ' H^ 1

: ^ r _ ' ^ ; • • • .!_ . ^ . ^ - . r

top-down parser begins with the root nodeand applies production rules to expandeach interior node until the leaves arcreached; if the terminal symbols hung onthe leaves match the complete input

string, the program has been parsed.Bottom-up parsing begins with the leavesand tries to build a superstructure ofhigher level nodes; the parse is successfulif the root node is reached.

There are still more choices to make.At any node having more than one child,the parser must decide which branch ofthe tree to explore first. In applying the

production Term :: = Factor MultOp Terma leftmost-first parser would expand Factor and then MultOp and finally Term,whereas a rightmost-first parser woulddeal with the symbols in the oppositeorder. Current fashion favors bottom-up,

rightmost-first parsers. They accept alarge class of grammars and tend to behighly efficient. Furthermore, widelyavailable programs automatically generate a parser of this type from a grammar

specification. The best known of the programs is Yacc (Yet Another CompilerCompiler), written by Stephen C. Johnson of AT&T Bell Laboratories.

Although creating a bottom-up,rightmost-first parser is easy with powertools such as Yacc, it is difficult to do byhand. In writing a compiler for Dada Ihave therefore taken the opposite path of

top-down, leftmost-first parsing. Thetechnique I chose is the one known asrecursive-descent. It is by no means themost efficient method, but it offers aremarkable transparency. You can see

straight through the structure of the parserto the underlying grammar.

If a grammar is to be suitable forrecursive-descent parsing, the productionrules must be free of left-recursion.Because the parser follows a leftmost-firstconvention, a rule of the form 5 :. = Sxleads into an infinite loop. The parsercreates a node labeled S and then, expand

ing the leftmost symbol of the production,appends a child node also labeled S. Itthen expands the leftmost symbol and continues adding S nodes indefinitely.

Eliminating left-recursion often entailsintroducing new nonterminal symbols andepsilon productions, which yield theempty string. The grammar for Dada

given in Table 1 is written without left-recursion. The symbol on the left side of arule is never the first symbol on the rightside.

Structure of a compilerTranslation, which is the basic task of a

compiler, cannot be done by merely looking up synonyms in a bilingual dictionary.The translator must take in the sourcetext, build an internal model of its structure and meaning, then create a new textin the target language based on the model.

Most compilers break the process intoat least four phases. The first phase is lexical analysis, which assembles the sourcetext into the fundamental units calledtokens. In Dada tokens include numbers,the names of variables and procedures,

keywords such as begin and end, and various signs and marks of punctuation. Thelexical analyzer, or scanner, reads astream of characters and emits a stream oftokens.

Parsing is the next phase. The parser'sinput is a stream of tokens and its output isa tree encoding the syntactic structure ofthe program. The tree organizes thetokens into higher level groups,such as

expressions, statements, and procedures.For example, the sequence of tokensX + 1 would yield a node labeled + withtwo children labeled X and 1.

Semantic analysis, the third phase,

assigns meaning to the nodes of the tree.But what does "meaning" mean in a com

puter that lacks consciousness and thecapacity for understanding? A pragmaticanswer will do for present purposes: the

meaning of a program is the series ofactions elicited when the program is executed. The task of the semantic analyzer isto interpret the abstract symbols manipulated by the parser as a prescription foraction.

In many cases the semantic analysis is

nothing more than a change in point ofview. When the parser reads the expression X + 1, it installs three tokens innodes of the parse tree without attaching

any significance to them. They are meremarkers, empty of content. The semantic

analyzer views the same tokens as instructions to add 1 to the value of the variableX. No change is made to the structure ofthe tree since the programming languageitself can express this meaning as well as

any other notation would.

One job often given to the semantic ana

lyzer is type checking. If Xis a floatingpoint variable and the literal value 1 represents an integer, the semantic analyzerwould be expected to detect the type mismatch. In many languages it would auto

matically insert an additional node in thetree calling for a type conversion.

The fourth phase of compiler operationis code generation. It is here that the bi

lingual dictionary—a simple table ofsubstitutions—becomes useful. The codegenerator traverses the parse tree, issuingtarget-language instructions at each node.The order in which the nodes are visitedensures that the instructions are generatedin the correct sequence. The traversal ofthe tree proceeds left-to-right and depth-first, visiting all the children of a givennode, then the node itself, then its siblings, then its parent, and so on.

In traversing a tree for the expressionX + 1 the leaf node for X would be reachedfirst, and the code generator would issueinstructions to load the current value of Xinto a register. Next 1 would be translated, perhaps by loading the value intoanother register. Finally the code generator would proceed to the parent node,find the plus sign, and produce instructions to add the two registers. The effect isto convert infix notation (where the operator appears between the operands) to

postfix notation (where the operator follows the operands).

Although the translations made at somenodes arc straightforward, there is morefor the code generator to do. It mustchoose the registers to be used in each

operation, and it must allocate space inmemory for variables. With structuredvariables (arrays, records, and the like)the allocation algorithms can become

fairly elaborate, and local variables complicate the process still further. Control-of-flow statements bring another burden:at any branch point in the program logic,the code generator must calculate the tar

get address for a jump instruction.One way to cope with the complexity of

code generation is to subdivide the task.Many compilers generate a form of intermediate code, made up of instructions fora fictitious machine with a regular archi-

! I ^ t&MI ' « ; ' ' ' ' '

r i i*nr*H[»ii

>'•,,', I

X := X + 1

Lexical

analysis

: = x + 1

Ass ignOp Ident P lus Number

Parsing

Semantic

analysis

I Integer

Integer

I n t e g e r I n t e g e r

Code

generation

x @ 1 + x I

hases in the operation of a compiler transform source text into object code. Lexical

nalysis breaks the stream of text into the fundamental units called tokens. Parsing assem-les the tokens into a tree structure. Semantic analysis attributes meaning to the symbols

f the nodes of the tree. Finally code generation creates a program of equivalent mean-

the target language (in this case Forth).Jill

SMPil

-r^:^->,J

52 COMPUTER LANGUAGE ! DECEMBER 1985

tecture. A subsequent phase translates theintermediate code into machine-languageinstructions, or a small interpreter executes the intermediate code on a real processor. An example of this strategy is theP-system developed at the University ofCalifornia at San Diego.

The final phase of compiling is optimization. The output of a typical code

generator is very inefficient, and there isno limit to the effort that can be put into

improving it. When the job is taken seriously, the optimizer may well be the largest component of the compiler. The Dadacompiler goes to the opposite extreme: theoptimizer is omitted entirely.

Passes and modules

Perhaps the most obvious way oforganizing a compiler is to make eachphase an independent module or even aseparate program. The scanner reads thesource text (a file of characters) andwrites a file of tokens. The parser readsthe file of tokens and produces a third fileof linked records that encode the structureof the parse tree. The subsequent phasesalso communicate by means of temporaryfiles. Each trip through the program iscalled a pass.

For the Dada compiler I have taken adifferent approach, in which all the

phases arc condensed into a single pass.The modules are organized as a bucket

brigade; they process small units of program text and pass them on to the nextmodule in line. The scanner, for example,reads just enough characters to assemble a

single token, and then hands it on to theparser.

A major advantage of the one-pass

compiler is that no explicit representationof the parse tree is needed. The operationof the parser itself can be interpreted as atraversal of the tree. Each time a procedure is called, the parser effectivelymoves down one level in the tree, to achild node; when the procedure returns,the parser climbs back up the tree to the

parent node. Semantic analysis and codegeneration can be done during thistraversal rather than in separate passesover the tree.

The scanner, parser, semantic analyzer,code generator, and optimizer are the"mainstream" modules of a compiler.Their sequential actions transform the

^ ^ ! H

^m

!

program text into object code. There areother modules as well. They do not act

directly on the stream of text, but they areno less essential.

The importance of an error-handlingroutine should be readily apparent. Mostof the time, for most programmers, whata compiler produces is not object code buterror messages. The error handler in theDada compiler is the simplest one I coulddevise. When it detects an error, it prints a

message and halts the program.A run-time library is needed to com

plement the facilities of most programming languages. Input and output statements, for example, require dozens oreven hundreds of machine-languageinstructions; rather than generate all thiscode for each statement, the compilercalls a preassembled routine. The library

procedures can be appended to the objectcode by the compiler itself or by a separate linker.

I have left to the end of this cataloguethe data structure at the heart of many

compiler operations: the symbol table. Itis the deus ex machina that enables thecompiler to enforce rules the grammarcannot describe.

A context-free grammar can balancepairs of symbols, such as nested parentheses, but it cannot match arbitrarily longstrings of symbols arbitrarily far apart inthe source text. In particular, it cannotmatch a reference to a variable with thecorresponding declaration. Verifying declarations requires a context-sensitivegrammar, with multiple symbols on theleft side of production rules. BecauseDada requires variables to be declared, itis not a fully context-free language.

Rather than complicate the grammarwith context-sensitive productions, thealmost universal practice in compiler

writing is to collect context-sensitiveinformation in a symbol table. The tablerecords information about all the identifiers, or named elements, of a program:variables, procedures, and so on. Theparser consults the table to make certainall identifiers have been properlydeclared. The semantic analyzer relies onit for type checking, and the code generator finds addresses there.

LeftParen

RiqhtParen

Lexical analyzer, or scanner, is constructed as a finite-state automaton. Each characterthat can begin a token leads to a distinct state. States enclosed by a double line are

accepting states, which the scanner enters when it has recognizedgQEu21£Q

Program

Declarations

V- j " 7 r ^ " - ' ^ ^ | " ! !

Useful as the symbol table is, from atheoretical and aesthetic point of view itrather mars the elegance of a compiler'sdesign. Without the symbol table the compiler is a purely sequential machine,climbing up and down the parse tree onenode at a time and carrying nothing with itas it moves. All the information needed isencoded in the tree itself. The symboltable introduces new linkages between

arbitrarily distant nodes. Connecting declarations to later invocations obscures thespare, planar form of the parse treebehind a topological knot of extraneouslinks.

procedure)

Statement

Identifier

Expression

I Statement

Identifier

The Dada compiler

Writing a compiler is a trilingual project.It involves a source language, a target lan

guage, and an implementation language(the one in which the compiler itself iswritten). In this case the source languageis Dada and the implementation languageis Borland International's Turbo Pascal.The selection of a target language raisesdifficult questions. If assembly ormachine language were chosen, the code

generator would overwhelm the rest of theprogram. Furthermore, the output of thecompiler would be intelligible only tothose who happen to know the instructionset of the processor chosen.

Intermediate code is easier to produceand understand, but it can be executedonly with an interpreter. Thus it would beawkward to demonstrate that the compiler

actually works. The compromise I havereached is to generate a form of intermediate code recognized by a familiar andwidely available interpreter. The output ofthe Dada compiler consists of "words" inthe programming language Forth. Thisintermediate code should be accepted(perhaps with a little fiddling) by anyForth interpreter. Since there are nowprocessors for which Forth is the assembly language, the Dada compiler mightalso be considered a native-code compilerfor one of those machines.

The place to begin in describing thecompiler is the scanner, or lexical analyzer. Figure 2 is a diagram of its structure. It is a finite-state automaton, a logicnetwork without auxiliary memory, withan accepting state for each token or class

54 Figure 3. (Continued on following page)

!! M B

of tokens. The automaton is implementedin software by means of a case statementwith a clause for each character that can

begin a token.For one-character tokens such as " + "

and ";" the clause is simple: the token is

recognized immediately. A few othertokens require two steps, for instance to

distinguish a colon from the assignmentoperator ": = ". Numbers and identifiersin Dada can be of unlimited length

(although they are truncated in the symboltable). A number is any sequence of digits; an identifier is any sequence of lettersand digits that begins with a letter. These

patterns are identified by while loops.The definition of a Dada identifier also

encompasses all the keywords of the language, such as //, begin, and end. Thescanner must recognize these words asindividual tokens and not confuse themwith named variables or procedures. One

way of identifying them is by creating atree of states for each keyword. In recog

nizing begin the scanner would passthrough states for/;, e, g, and so on.There is an easier way. The scanner ini

tially classifies any string of letters as anidentifier and then checks it against a listof keywords; only if a match is not foundis the code for an identifier returned.

Although lexical analysis is the simplest phase of compilation, the design ofthe scanner raises questions that are notanswered in the grammar. Is the languageto be case-sensitive, or do FOO, Foo, and

foo all refer to the same object? How arewhitespace characters, such as blanks,tabs, and carriage returns, to be treated?What about comments in the source text?

It is possible to be vague about theseissues in designing a language but not in

writing a compiler. I have generally followed the example of Pascal. The scannerconverts all alphabetic characters to upperra<u* nnrt «n paqp is ionnrrH Whitrvsmrf"

is allowed between any tokens but is

required only to separate identifiers, keywords, and numbers. A comment isallowed anywhere a whitespace charactercould appear. A left brace signals the start

t _ • if... !» .

iimpleExpr.^ !!! ^ ^ !! H

iimpleExprI' S*

e t x p r

edFactor

Factor

''jl*irJ[A

Number

Boolean

Identifier

Expression

wmtax diagrams derived from the grammar define the flow of control through a Dada

gram must begin with the keyword program followed by an identifier and a semicolon.Control then passes to the Declarations diagram and the Block diagram. Any sequence oftokens corresponding to a path through the diagrams is a valid Dada program.

Figure 3. (Continued from preceding page)

edure ParseProgroedure ParseVari

beginif TK.Code = VarSym thenbegin

GetTK;repeatif TK.Code <> Ident then Error(Xldent); GetTK;if TK.Code <> Colon then Error(XColon); GetTK;if not (TK.Code in TypeSet) then Error(XType); GetTK;if TK.Code <> Semi then Error(XSemi); GetTK;

until (TK.Code in [ProcSym,BeginSym]);end;

end;procedure ParseBlock;procedure ParseStatement;var IdentPtr : SymPtr;

procedure ParseExpression;procedure ParseSimpleExpr;procedure ParseTerm;procedure ParseSignedFactor;procedure ParseFactor;b e g i n { P a r s e F a c t o r }

S c a s e T K . C o d e o fTrueSym, FalseSym, Number, Ident : GetTK;LeftParen : begin

GetTk; ParseExpression;if TK.Code <> RightParenthen Error(XParen); GetTK;

end;else Error(XFactor);

end;end;

b e g i n { P a r s e S i g n e d F a c t o r }if (TK.Ccde in [Plus, Minus, NotSym]) then GetTK;ParseFactor;

end;

beginParseSigned Factor;if (TK.Code in MultOpSet) thenbegin GetTK; ParseTerm; end;

end;

beginParseTerm;if (TK.Code in AddOpSet) thenbegin GetTK; ParseSimpleExpr; end;

end;

Listing 1. (Continued on following page)

{ ParseTerm

{ ParseSimpleExpr ]

of a comment, and all subsequent characters arc ignored until a right brace isfound.

The parser

Building a parser is the best way to appreciate the worth of a formal grammar.

Comparing the routines of the parser withthe production rules of the grammarreveals an obvious one-to-one

correspondence.The transformation of a grammar into a

parser can be done directly; that is whatYacc does. When working by hand, however, it is easier to create a series of syntaxdiagrams based on the grammar and thento build the parser from the diagrams.Where a production rule calls for generating a series of symbols, they are writtendown in sequence and connected byarrows. Alternative productions corre

spond to branches in the diagram, and arecursive rule—one that invokes itself—

implies a diagram with a loop. A set ofsyntax diagrams for Dada is shown in Figure 3.

Each of the diagrams is represented bya procedure in the parser. At the top levelthe procedure ParseProgram expects tosee the key word program followed by anidentifier and a semicolon. It then callsParseVariables to handle any declarationsof variables; when that procedure returns,a semicolon is expected. Next there is acall to ParseBlock, which analyzes the restof the program. When ParseBlock returns,all that remains is to check for the con

cluding period.Some of the other diagrams are more

elaborate; ParseBlock serves as a goodexample. It first checks for the keywordprocedure; if it is present, ParseBlockreads an identifier and a semicolon andthen calls itself recursively. The newinvocation of ParseBlock immediatelylooks for procedure again, so that nesteddeclarations can be accommodated. Even

tually, however, each invocation of ParseBlock must find not another procedure butthe keyword begin. ParseStatement is thencalled repeatedly until the token end'xsencountered.

The organization of the parsing procedures follows directly from the structure of the language. A Dada programconsists of variable declarations and ablock; a block consists of procedure dec-

58 COMPUTER LANGUAGE* DECEMBER 1985

larations and a list of statements; a procedure declaration consists of subsidiary

procedures and another block. Accordingly, ParseProgram is the outermost routine, and both ParseVariables and ParseBlock are nested within it. ParseBlock inturn encloses ParseStatement, and there isa further hierarchy of nested routinesfrom ParseExpression down toParseFactor.

All the routines work by a commonmechanism. At any moment the parsercan take only two possible actions. If aterminal symbol is expected, the currenttoken is examined. If it is one allowed bythe grammar at this point in the program,the scanner is called to get the next tokenand the parser moves on; otherwise anerror message is issued. If the symbolexpected is a nonterminal, the parsermerely calls the appropriate routine.

If a parser for a context-free language isa pushdown automaton, where in thisscheme of operations is the pushdownstack? It is present but well hidden. It isthe return-address stack maintained by

any executing Pascal program, in thiscase the Dada compiler itself. One of the

advantages of the recursive-descent technique is that it relieves the programmer ofresponsibility for managing the stack.

Table 2 shows the parser in action, analyzing the expression 3*(4+5). According to the syntax diagrams, an expressionconsists either of a simple expression orof two simple expressions separated by arelational operator. Thus to identify the

input as an expression, the first step is tolook for a simple expression. A simple

expression in turn begins with a termwhich begins with a signed factor, whichconsists of an optional sign followed by afactor. Finally a factor, at the bottom ofthe hierarchy of program structures, canbe a number, a Boolean value, an identifier, or a parenthesized expression.

Routines to recognize each of these elements are activated in a cascade of procedure calls. ParseExpression calls Parse

SimpleExpr which calls ParseTerms whichcalls ParseSignedFactor which calls ParseFactor. At last, after five nested calls, thefirst token of the input (the number 3) isidentified as a factor.

beginParseSimpleExpr;if (TK.Code in RelOpSet) thenbegin GetTK; ParseSimpleExpr; end;

end;

segincase TK.Code of

BeginSym : begin

ParseExpression }

{ ParseStatement }

GetTK; while TK.Code <> EndSym dobegin

ParseStatement;if not (TK.Code in [Semi,EndSym]) thenError(XSemEnd);

if TK.Code = Semi then GetTK;end;

GetTK;end;

I f S y m : b e g i nGetTK; ParseExpression;if TK.Code <> ThenSym then Error (XThen); GetTK;ParseStatement;if TK.Code = ELseSym then

begin GetTK; ParseStatement; end;end;

WhileSym : beginGetTK; ParseExpression;if TK.Code <> DoSym then Error(XEo); GetTK;ParseStatement;

end;I d e n t : b e g i n

IdentPtr := Find(TK.Name);if IdentPtr".Class = Variable thenbegin

GetTK; if TK.Code <> AssignOp thenError(XAssgn);

GetTK; ParseExpression;end

else GetTK;end;

r f . s e E r r o r ( X S t m t ) ;

_ , 1 { P a r s e B l o c k }while TK.Code = ProcSym dobegin

! GetTK; if TK.Code <> Ident then Error(Xldent);

GetTK; if TK.Code <> Semi then Error (XSemi);GetTK; ParseBlock;if TK.Code <> Semi then Error (XSemi); GetTK;

end;if TK.Code <> BeginSym then Error(XBegin); GetTK;while TK.Code <> EndSym do

egin

Listing 1. (Continued on following page)

! .*. > ! !

!s?"S***

i < . -

^ M

m

As each of these calls is made, a recordof the calling routine is pushed onto thereturn-address stack. When ParseFactor

completes it work, therefore, it returns tothe point from which the call was made in

ParseSignedFactor. That routine is alsofinished, and it returns to ParseTerm.There is now a choice between two pathsof execution. If the next token is anythingother than a multiplying operator, ParseTerm simply exits. In fact the next tokenis "*", and so the other path is taken. Thetoken is consumed and ParseTerm is called

again, recursively, to initiate another cascade back down to ParseFactor.

The gyrations of the parsing routineseventually trace out the tree structure forthe expression. At the root of the tree isthe "*" sign, and under it are nodes for"3" and for "(4 + 5)." The latter nodehas another subtree under it, with leafnodes for the factors "4" and "5." The

parser has no need to keep a diagram ofthese relations because they are effec

tively recorded in the sequence of returnaddresses on the stack. Each call to a routine creates a new child node on the tree,whose parent is indicated by the address atthe top of the stack. No matter how deeplynested the routines become, the parser can

always find its way back to the correctparent node by following the trail ofstacked addresses.

A m b i g u i t i e sThe Dada compiler employs a form of therecursive-descent technique known as

predictive parsing, which forbids lookingahead at the next input symbol or back

tracking to earlier symbols. In all circumstances the parser must be able to predict,based on the current token alone, which ofthe paths available to it should befollowed.

If a language is to be suitable for predictive parsing, the grammar and the syntax diagrams must have a particular form.At any point where a diagram branches,the first token expected along each branchmust be unique. Consider the syntax dia

gram for a Dada statement. There are fivebranches, corresponding to an assignmentstatement, an //statement, a while statement, a procedure call, and a compoundstatement bracketed by begin and end.Three of the branches cause no difficulty.If the first token in a statement is //, while,or begin, the parser knows which path tofollow. The assignment statement and the

procedure call, however, both start with

if TK.Code = Semi then GetTK;end;

GetTK;end;

begin { ParseProgram }if TIC.Code <> PgmSym then Error(XPgm); GetTK;if TK.Code <> Ident then Error (Xldent); GetTK;if TK.Code <> Semi then Error (XSemi); GetTK;ParseVariables; ParseBlock;if TK.Code <> Dot then Error(XDot);

end;

Dada parser has a procedure for each of the syntax diagrams in

Figure 3. The nesting of the procedures reflects the structureof the grammar in that a statement is subsidiary to a block, anexpression is subsidiary to a statement, and so on. In the formshown here the parser acts as a recognizer, a machine thataccepts any valid stream of tokens but ignores their semanticcontent. In the complete compiler statements for type checking

" for code generation are interleaved with the parsing routine

Listing 1. (Continued from preceding page)

an identifier. Thus it seems Dada does not

quality for predictive parsing.Three solutions to this problem present

themselves. One could change the lan

guage, perhaps requiring a procedure tobe invoked by the keyword call, as inPL/I. One could change the grammar,

introducing distinct tokens for procedurenames and variable names. Or one couldchange the parser, allowing it to lookahead to see if the next token is an assignment operator.

I have adopted a fourth solution: cheat

ing. At the time an identifier is declared, asymbol-table entry is created, recording(among other things) whether the identifier names a variable or a procedure.When the same identifier appears again, itis a simple matter to check itsclassification.

A parser can construct a unique tree fora program only if the grammar is

unambiguous. What assurance can begiven that the Dada grammar meets thiscondition? As it happens, the grammardoes include an ambiguity, common to

many programming languages, known asthe dang\ing-else problem. In a nestedconditional statement, such as if A then ifB then C else D, the grammar does notstate which then the else is to be associatedwith. Is D executed when A is false orwhen B is false? The two possibilities cor

respond to different tree structures.The dangling-e/^e problem is solved by

making an arbitrary choice. By convention each else is associated with theclosest preceding then that does not

already have an else. The parser needs nomodification to enforce this rule. The

sequence of procedure calls and returnsalways generates the correct tree.

The grammar for Dada fails to settle anumber of other issues that must beresolved in writing the compiler. Thegrammar implies that procedures must bedeclared, but it does not say they have tobe declared before they are first called.The decision to distinguish between

assignment and call statements by consulting the symbol table effectively requiresthe declaration to come first.

What about a procedure called fromwithin its own declaration, that is, arecursive procedure? Nothing in the

grammar forbids recursion, and the Dadacompiler will accept recursive calls without complaint. The trouble is, a Forth


^ ^ . • : ; .

wt S M H B i l

interpreter will not execute the recursivedefinition; it causes a run-time error.

Another issue the grammar does notaddress is the redefinition of keywords.What if some programmer wants todeclare a variable named if or a procedurenamed begin? Some languages allow such

shenanigans and some do not. Dada is inthe latter category because of the way keywords are detected.

SemanticsThe scanner and the parser together makea recognizer for Dada programs. Theyaccept only syntactically valid strings ofsymbols. Unfortunately, the category ofsyntactically valid strings includes manythat are semantically deviant. The expression X + yhas impeccable syntax, but ifX is an integer and Y is a Boolean variable,it has no clear meaning. What is the valueof8 + True?

The type checking done in the course ofsemantic analysis improves the odds that a

program will be accepted only if it makessome sense. In the Dada compiler there isno separate module for type checking; it isdone by statements intercalated into the

parser. Most of them refer to the symboltable, which should therefore be discussed in greater detail.

The symbol table is a linked list ofrecords with five fields. The name fieldstores the first 31 characters of an identifier's name (an arbitrary length limit).The class field indicates whether an identifier is a variable or a procedure, and the

type field further classifies variables asinteger and Boolean values. A scope fieldholds a number that represents the nesting

depth of procedures. The final field is apointer to the next record.

Symbol-table entries are created by aprocedure named Declare, which is calledwith an identifier as one of its arguments.Declare checks to sec if the identifier is

already present in the table; if not, it isinstalled at the head of the list. The complementary function, Find, searches thelist for an identifier and returns a pointerto it.

The one subtlety in the handling of the

symbol table has to do with nested procedures, which are intended to be local tothe block in which they are declared. Inother words, a procedure declared insideanother procedure should "disappear"when the compiler passes beyond the end

of the enclosing block. This is accom

plished by monitoring the scope field ineach symbol-table record. A procedurenamed Blot, called at the end of a block,removes all entries that can never legallybe accessed again.

Type checking seems simple in principle, but it can become surprisinglymessy. It is easy enough to verify thesemantics of a statement such as X: = Y:

just look in the symbol table to make sureXand Kare of the same type. Where difficulties arise is in assigning types to

expressions. In analyzing X + y*Zthetypes of the variables must be checked,

then a result type must be determined

according to combinatorial rules.Such rules exist, but they do not form a

coherent system. In most cases the operands and the result arc all expected to beof the same type, but relational operationsarc an exception: they always yield aBoolean result. Whether logical operators(and, or, not, and so forth) can be appliedto numerical values is a matter of opinionand taste. With languages that have awider selection of data types the issues getstickier. Can strings be compared oradded? What about records and pointers?

The type checking done in the Dada

xpression

A c t i v e p r o c e d u r e

Expression

SimpleExprTerm

SignedFactorFactor

SignedFactorTerm

Term

SignedFactorFactor

Expression

SimpleExprTerm

SignedFactorFactor

SignedFactorTerm

SimpleExpr

SimpleExprTerm

SignedFactorFactor

SignedFactorTerm

SimpleExpr

SimpleExpr

ExpressionFactor

SignedFactorTerm

Term

SimpleExpr

Expression

R e m a i n i n g i n p u t3 * (4 + 5)3 * (4 + 5)3 * (4 + 5)3 * (4 + 5)3 • (4 + 5). (4 + 5)• (4 + 5)

(4 + 5)(4 + 5)(4 + 5)4 + 5)4 + 5)4 + 5)4 + 5)4 + 5)+ 5)+ 5)+ 5)5)5)5)5)

A c t i o n

Gen( '3 ' )

H o l d M u l t O p : = ' • '

Gen( '4 ' )

HoldAddOp:=' + '

Gen('5')

Gen(HoldAddOp)j ' + '

G e n ( H o l d M u l t O p ) { ' . ' |

Expression is parsed by a series of nested procedure calls. The process begins when

ParseExpression calls ParseSimpleExpr, which in turn calles ParseTerm; ultimately whenParseFactor is called, it recognizes and consumes the first token of the input. Actions taken

by the compiler include calls to Gen, the code generator. The output resulting from thecalls is the sequence of symbols 3 4 5+ •.

Table 2.

compiler is not rigorous or exhaustive.Each of the routines in the expression-

parsing hierarchy is defined as a functionthat returns the type of the structure it has

parsed. Hence types propagate upward inthe tree from the leaf nodes to the main

expression node.

The code generatorThe last major subunit of the Dada com

piler, the code generator, is deceptivelysimple. It is deceptive because all of the

Hailstone;

N : i n t e g e r ;Cdd : boolean;

procedure NextTerm;procedure CheckCdd;begin

if (N mod 2 = 0) th(Cdd := False elseCdd := True

end;procedure DownStep;begin

N := N/2end;

procedure UpStep;begin

N := 3*N+1end;

begin { NextTerm }CheckCdd ;if Cdd then UpStep eDownStep

end;begin { main prograrReadLn N;while N > 1 do

beginWriteLn N;NextTerm

end;end.

Sample Dada programtests the operation ofthe compiler. The algorithm has been broken

up into several smallprocedures to exercisethe parsing routines.

Listing 2.

difficult tasks—such as calculatingaddresses and allocating storage—arcdone by the Forth target system.

In a pinch, a single Write statementwould make a serviceable code generatorfor a Dada-to-Forth translator. I haveadded a few lines of code to format the

output in standard Forth "screens" of1,024 characters. A subsidiary routineadds a small run-time library to each pro

gram. The main components of the libraryare Read and Write procedures that provide rudimentary communication with theconsole.

The real work of translating Dada intoForth is done not by the code-generatingroutine itself but by the placement of callsto the routine throughout the parser. Thecalls cause the code generator to define aForth word for each procedure in theDada program and another word for themain block.

A Forth definition begins with a colonand the name of the word, followed by the

body of the definition and a terminatingsemicolon. Any previously defined Forthword can be invoked by giving its name.All operations are carried out on values atthe top of a stack. Statements and expressions are described in postfix notation.

The operating principles of the codegenerator can be illustrated by tracing thetranslation of a small procedure. Thebody of the procedure consists of a singlestatement whose effect is to halve thevalue of a previously declared variable N.The Dada statements read as follows:

procedure DownStep;begin

N:= N/2end;

When any procedure declaration isreached in a Dada program, the activeroutine in the compiler is ParseBlock.From the initial keyword it deduces thatwhat follows is a procedure. The identifier DownStep is then stored in a localvariable (HoldID), but no code is generated. The parser cannot yet know whetherthere are further, nested procedure declarations. If there are, object code for themwill have to be created first.

When the keyword begin is parsed,

ParseBlock is called recursively with thecontents of HoldID passed as a parameter.The body of DownStep has now beenreached, and so code generation canbegin. A colon is written to the outputfile, followed by the passed parameter(namely, the string DownStep). Thus anew Forth definition is begun.

ParseBlock now calls ParseStatement,the next routine in the hierarchy. The firsttoken of the remaining input is found to bean identifier, and a search of the symboltable reveals it represents a variable; thestatement must therefore be an assignment. The string N is stored in anotherinstance of the local variable HoldID, and

ParseExpression is called to deal with theright side of the statement.

ParseExpression sets in motion the cascade of expression-handling routines,which ultimately identify the next token,N, as a factor. At this point the firstinstructions for the body of the procedurecan be issued. Since the factor is a reference to a variable, the appropriate actionis to copy the value of the variable from itslocation in memory onto the top of thestack. The code for doing this consists ofthe variable name N followed by the Forthword "@," pronounced "fetch." Hencethe symbols N@ are appended to the output file.

The compiler now makes its way back

upward through the series of expression-handling functions as far as ParseTerm,which recognizes the division sign as amember of the set of multiplying operators. Code for division is not generated,

however, since expressions in Forth mustbe given in postfix form. The "/" token is

merely stashed in another local variable,HoldMuhOp.

ParseTerm now calls itself, so that twoinstances of the function are active, andthe compiler thereupon descends again tothe level of ParseFactor. Here the second

operand, "2," is recognized. Because it isa literal numeric value, which requires no

memory reference, the token alone, without accompanying symbols, is output.

On returning to the first invocation ofParseTerm the compiler retrieves the


pending operator from the variable Hold-MultOp and sends it to the code generator.The expression N/2 has now been fully

parsed, and the corresponding Forthwords have been issued. Execution ofthese words will leave the new value of A7on the top of the Forth stack. It remainsfor the parser to return to ParseStatement,where the first occurrence of A7, from theleft side of the assignment statement, hasbeen sequestered. The goal now is to storethe value on the top of the stack at theaddress represented by N. The necessaryForth instructions are N! (the exclamation point is pronounced "store"), and

they are duly issued.There is one further addition to the

object code. When ParseBlock reads theend keyword, signifying that there are nomore statements in the block, it calls forthe generation of a semicolon to mark theend of the Forth definition. The completetranslation is:

: DownStep N@ 2 / N !;

It defines a new Forth word called Down-

Step. Whenever it is executed, it retrievesthe current value of the variable N, dividesit by 2, and stores the new value at theaddress reserved for N.

Pieces of the puzzle

Throughout this series of articles I havebeen probing the curious, murky relationsbetween syntax and semantics, form and

content, recognition and understanding.A sentence or a program can be grammatically flawless and yet signify nothing.Indeed, it is easy to create entire lan

guages that convey no meaning. Whenmeaning does emerge, then, where does itcome from?

One approach to this question views

language as a jigsaw puzzle. Grammardetermines the shapes of the pieces andhow they are to be assembled. From the

standpoint of grammar alone, however,the pieces are empty markers to be shuffled about on the table top. Meaning exists

only in the picture printed on the surfaceof the puzzle. For the picture to makesense the pieces must be assembled cor

rectly, and yet there is no necessary correspondence between the pattern of the picture and the pattern of the jigsaw cuts.

What is troubling about this metaphoris that it makes the relation between grammar and meaning completely arbitrary. A

given picture could be cut up into manydifferent sets of pieces, and a given cut

ting of the puzzle could have any picturewhatever printed on it. This is the worldof Humpty Dumpty, where words meanwhatever we want them to.

The arbitrary mapping from syntax tosemantics is all too apparent in the Dada

compiler. The scanner and the parser havesound theoretical underpinnings; one can

give algorithms for their construction.The semantic tasks of type checking andcode generation, on the other hand, aredone by ad hoc procedure calls hung onthe parse tree like Christmas ornaments.

0 V A R I A B L E N ( C o d e f o r p r o g r a m H a i l s t o n e )0 VARIABLE ODD: CHECKCDD N @ 2 MDD 0 = IF FALSE ODD ! ELSE TRUE ODD ! THEN ;: DOWNSTEP N @ 2 / N 1 ;: UPSTEP 3 N @ * 1 + N I ;: NEXTTERM CHECKCDD ODD 0 IF UPSTEP ELSE DOWNSTEP THEN ;: HAILSTONE N READ BEGIN N @ 1 > WHILE N WRITE NEXTTERM REPEAT ;;S

output of the compiler consists of "words" to be executed byrth interpreter.

Listing 3.

Turbo Pascal Science,

Engineering andData Acquisition Tools

Save time and money by incorporating these

accurate, pretested Turbo Pascal procedurelibraries into your own custom application

programs. Developed by experts in eachfield, these tools can save hundreds of hoursof research and program development time.All tools are supplied on IBM PC compatiblediskettes and come complete with detaileddocumentation and source code listings.All of the tools are compatible with Turbo andTurbo-87 Pascal Rev.3.0 and higher. Example

programs on disk make learning how to use theprocedures quick and easy.

General Science and Engineering Tools

(IPC-TP-006)

Include these procedures in yourTurbo Pascal

application programs for general statistics,multiple regression, curve fitting, integration,FFT's.file transfers to Lotus 1-2-3,solvingsimultaneous equations, matrix math, linear

programming, data smoothing and graphics(line plots, bar graphs, scatter plots, semi-log

graphs.log graphs and windows). $69.95.

Data Acquisition and Control Tools

(IPCTP-007)TheTurbo Pascal Data Acquisition and ControlTools package supports the IBM DACA(Data

Acquisition and Control Adapter), Cyborg Isaac411 and Cyborg Isaac 911. Analog inputs canbe sampled at up to 18K samples per second.Procedures are also supplied for analog

output, digital inputand output, thermocouplelinearization,PID control, real-time graphics(bar graphs and line plots) and FFT's. Menu-driven example programs for data logging intoLotus 1-2-3 files, high speed data acquisition,

process control and real-time graphics let theuser start acquiring and analyzing analogdata immediately. The data acquisition andcontrol package requires the IBM Data

Acquisition and Control Adapter ProgrammingSupport software.$94.95.

To order or for free information call orwrite:

Q u i n n - C u r t i s ( 6 1 7 ) 9 6 9 - 9 3 4 3Software Division MasterCard and7 F r e d e t t e R d . V I S A a c c e p t e dNewton Centre, MA 02159

Add $5.00 lor shipping and handling in US and Canada

and $10.00 for shipping and handling outside US and

Canada. Mass. residents add 5% sales tax.

Quinn-Curtis

PC Software for Scientists and Engineers

Turbo Pascal is a registered trademark ol Borland International. IncIBM and OACA are registered trademarks of International BusinessMachines lotus 1-2-3 is a registered trademark ol Lotus OevelopmeCorporation Isaac 411 and 911 are registered trademarks ol CyborgCorporation

Ar twork : Anne Doer ing

Personal REXXfor the IBM PC

* Interpreter for the full REXX language, including all of the standard

REXX instructions, operators, and built-in functions

!k Sophisticated string manipulation capabilities

* Unlimited precision arithmetic

* Direct execution of DOS commands from REXX programs

* Built-in functions for DOS file I/O, directory access, screen and keyboard

communication, and many other PC services

* Compatible with VM/CMS version of REXX

* Uses include:

- Command programming language for DOS

- Macro language for the KEDIT text editor

- Can be interfaced by application developers with other DOS

applications, written in almost any language

Mansfield Software Group

P. O. Box 532

Storrs, CT 06268

(203)429-8402

S98 plus $3 shipping until 1/1/86

$125 plus S3 shipping after 1/1/86

MC, VISA, COD, PO, CHECK

CIRCLE 38 ON READER SERVICE CARD

The Best 68000 & 32000 Compilers

GUARANTEED.

OEMs: You get an unconditional 30 day money back

guarantee that our compilers generate faster and smallercode than any other corresponding 68000 or 32000 indus

try standard compiler.

All compilers include 1 year of maintenance and updates.

C - FORTRAN 77 - Pascal

Available NOW for:

Motorola 68000, National 32000

UNIX 4.2 BSD, UNIX System V

Green Hills Software

425 E. Colorado Blvd., Suite 710

Glendale, California 91205

(818) 246-5555

Leaders in stamping out vaporware and puffware.

UNDC is a trademark of AT&T Bell Laboratories

CIRCLE 37 ON READER SERVICE CARD

64 COMPUTER LANGUAGE ! DECEMBER 1985

There is no semantic specification of the

language comparable to the syntactic

specification given in a set of productionrules.

Schemes for semantic specification do

exist. In one such scheme, called an attri

bute grammar, a semantic rule is associ

ated with each syntactic production rule.

Attributes, which can be thought of as

slots for semantic values, are attached to

each node of a parse tree. Whenever a

production is invoked during the parsingof a program, the associated semantic rule

is executed to fill in the attribute slots of

the current node. The rule calculates the

attribute values based on the values stored

in other nodes. The values can include

data types, fragments of object code, or

any other computable information.Attribute grammars and other inven

tions of formal semantics promise to

improve the methodology of compiler

design. An open question is whether theycan also improve understanding. A grammar is not only a useful tool but also an aid

to comprehension. You can look at a few

dozen production rules and see in them an

infinite variety of program structures. As

yet no semantic rules offer a comparable

grasp of the infinite range of programmeanings. —

References

Aho, Alfred V, and Jeffrey D. Ullman. Prin

ciples of Compiler Design. Reading, Mass.:Addison-Wcsley Publishing Co., 1977.[Known as the "dragon book," from theillustration on the cover. Gives an intro

duction to the theory, with emphasis on

bottom-up parsers and automated compiler-writing tools.]

Aho. Alfred V., Ravi Sethi and Jeffrey D.Ullman. Compilers: Principles. Techniquesand Tools. Reading, Mass.: Addison-

Wesley Publishing Co.. 1986. IA newdragon book, revised and enlarged, withmore material on formal semantics.]

Wirth, Niklaus. Algorithms + Data Structures= Programs. Englewood Cliffs, N.J.:

Prentice-Hall Inc., 1976. [Includes a briefbut lucid treatment of the practicalitiesof compiler design, with a fully worked

example.]Kernighan. Brian W.. and P.J. Plauger. Soft

ware Tools. Reading, Mass.: Addison-

Wcsley Publishing Co., 1976. [Concludeswith a presentation of a RATFOR-to-FORTRAN translator, based on a handcrafted compiler. |

Amsterdam, Jonathan. "Context-Free Parsingof Arithmetic Expressions." BYTE. Aug.,1985. | A recent expression-parsing pro

gram, with source code in Pascal.]Holub, Allen. "How Compilers Work—A

Simple Desk Calculator." Dr. Dobb s Journal. Sept. 1985. [Another recent expression

parser, with source code in C]

Brian Hayes is a writer who works in both

natural and formal languages. Until 1984

he was an editor of Sc'\cn\\t\c American,where he launched the Computer Recre

ations department.

Part III: A homemade compiler -...

Documents

Transcript of Part III: A homemade compiler -...