Compiler Phases - GIT · Compiler Phases The compilation process contains the sequence of various...

RTU Exam Paper 2019

7th

Sem CSE

Subject: Compiler construction

UNIT-I

Q.1 a. Explain the different phases of compiler with the help of suitable diagram.

Answer:

Compiler Phases

The compilation process contains the sequence of various phases. Each phase takes source program in one

representation and produces output in another representation. Each phase takes input from its previous stage.

There are the various phases of compiler:

Fig: phases of compiler

Lexical Analysis:

Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It reads the source

program one character at a time and converts it into meaningful lexemes. Lexical analyzer represents these lexemes

in the form of tokens.

Syntax Analysis

Syntax analysis is the second phase of compilation process. It takes tokens as input and generates a parse tree as

output. In syntax analysis phase, the parser checks that the expression made by the tokens is syntactically correct or

not.

Semantic Analysis

Semantic analysis is the third phase of compilation process. It checks whether the parse tree follows the rules of

language. Semantic analyzer keeps track of identifiers, their types and expressions. The output of semantic analysis

phase is the annotated tree syntax.

Intermediate Code Generation

In the intermediate code generation, compiler generates the source code into the intermediate code. Intermediate

code is generated between the high-level language and the machine language. The intermediate code should be

generated in such a way that you can easily translate it into the target machine code.

Code Optimization

Code optimization is an optional phase. It is used to improve the intermediate code so that the output of the program

could run faster and take less space. It removes the unnecessary lines of the code and arranges the sequence of

statements in order to speed up the program execution.

Code Generation

Code generation is the final stage of the compilation process. It takes the optimized intermediate code as input and

maps it to the target machine language. Code generator translates the intermediate code into the machine code of the

specified computer.

Example:

b. Explain the following terms

Translators

Answer: The most general term for a software code converting tool is “translator.” A translator,

in software programming terms, is a generic term that could refer to a compiler, assembler, or

interpreter; anything that converts higher level code into another high-level code (e.g., Basic,

C++, Fortran, Java) or lower-level (i.e., a language that the processor can understand), such as

assembly language or machine code. If you don’t know what the tool actually does other than

that it accomplishes some level of code conversion to a specific target language, then you can

safely call it a translator.

Interpreters

Answer:

The translation of single statement of source program into machine code is done by language processor

and executes it immediately before moving on to the next line is called an interpreter. If there is an error

in the statement, the interpreter terminates its translating process at that statement and displays an error

message. The interpreter moves on to the next line for execution only after removal of the error. An

Interpreter directly executes instructions written in a programming or scripting language without

previously converting them to an object code or machine code.

Example: Perl, Python and Matlab.

Compiler

Answer:

A compiler is a translator that converts the high-level language into the machine language.

o High-level language is written by a developer and machine language can be understood by the processor.

o Compiler is used to show errors to the programmer.

o The main purpose of compiler is to change the code written in one language without changing the meaning

of the program.

o When you execute a program which is written in HLL programming language then it executes into two

parts.

o In the first part, the source program compiled and translated into the object program (low level language).

o In the second part, object program translated into the target program through the assembler.

Fig: Execution process of source program in Compiler

Bootstrapping

o Bootstrapping is widely used in the compilation development.

o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of compiler that

can compile its own source code.

o Bootstrap compiler is used to compile the compiler and then you can use this compiled compiler to compile

everything else as well as future versions of itself.

A compiler can be characterized by three languages:

1. Source Language

2. Target Language

3. Implementation Language

The T- diagram shows a compiler SCI

T for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCA

A for subset, S of the desired language, L using language "A" and that compiler runs on

machine A.

2. Create a compiler LCS

A for language L written in a subset of L.

3. Compile LCS

A using the compiler

SCA

A to obtain

LCA

A.

LCA

A is a compiler for language L, which runs on machine

A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

c. illustrate the translation of the following statement on all compiler phases

A:- B*C+D/E

Answer:

Or

Q. a Explain the following in brief

Input buffering

Functions of the lexical Analyzer

Answer: The function of Lex is as follows:

o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs the lex.1

program and produces a C program lex.yy.c.

o Finally C compiler runs the lex.yy.c program and produces an object program a.out.

o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

Lex file format

A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows:

1. { definitions }

2. %%

3. { rules }

4. %%

5. { user subroutines }

Definitions include declarations of constant, variable and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.

Where pi describes the regular expression and action1 describes the actions what action the lexical analyzer should

take when pattern pi matches a lexeme.

User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the

lexical analyzer and compiled separately.

b. Construct minimum state DFA’s for the following regular expression

(a/b)*a(a/b)

(a/b)*a(a/b)(a/b)

(a/b)*a(a/b)(a/b)(a/b)

UNIT II

Q.1 a. explain top down and bottom up parsing techniques in details

Answer.

Parser

Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis phase.

A parser takes input in the form of sequence of tokens and produces output in the form of parse tree.

Parsing is of two types: top down parsing and bottom up parsing.

Top down paring

o The top down parsing is known as recursive parsing or predictive parsing.

o Bottom up parsing is used to construct a parse tree for an input string.

o In the top down parsing, the parsing starts from the start symbol and transform it into the input symbol.

Parse Tree representation of input string "acdb" is as follows:

Bottom up parsing

o Bottom up parsing is also known as shift-reduce parsing.

o Bottom up parsing is used to construct a parse tree for an input string.

o In the bottom up parsing, the parsing starts with the input symbol and construct the parse tree up to the start

symbol by tracing out the rightmost derivations of string in reverse.

Example

Production

1. E → T

2. T → T * F

3. T → id

4. F → T

5. F → id

Parse Tree representation of input string "id * id" is as follows:

Bottom up parsing is classified in to various parsing. These are as follows:

1. Shift-Reduce Parsing

2. Operator Precedence Parsing

3. Table Driven LR Parsing

a. LR( 1 )

b. SLR( 1 )

c. CLR ( 1 )

d. LALR( 1 )

b. construct an LL(0) parse table grammar calculate FIRST and Follow as needed

Anwser:

Rules to compute FIRST set:

1. If x is a terminal, then FIRST(x) = { ‘x’ }

2. If x-> Є, is a production rule, then add Є to FIRST(x).

3. If X->Y1 Y2 Y3….Yn is a production,

1. FIRST(X) = FIRST(Y1)

2. If FIRST(Y1) contains Є then FIRST(X) = { FIRST(Y1) – Є } U { FIRST(Y2) }

3. If FIRST (Yi) contains Є for all i = 1 to n, then add Є to FIRST(X).

Rules to compute FOLLOW set:

1) FOLLOW(S) = { $ } // where S is the starting Non-Terminal

2) If A -> pBq is a production, where p, B and q are any grammar symbols,

then everything in FIRST(q) except Є is in FOLLOW(B).

3) If A->pB is a production, then everything in FOLLOW(A) is in FOLLOW(B).

4) If A->pBq is a production and FIRST(q) contains Є,

then FOLLOW(B) contains { FIRST(q) – Є } U FOLLOW(A)

Example:

Production Rules:

E -> TE’

E’ -> +T E’|Є

T -> F T’

T’ -> *F T’ | Є

F -> (E) | id

FIRST set

FIRST(E) = FIRST(T) = { ( , id }

FIRST(E’) = { +, Є }

FIRST(T) = FIRST(F) = { ( , id }

FIRST(T’) = { *, Є }

FIRST(F) = { ( , id }

FOLLOW Set

FOLLOW(E) = { $ , ) } // Note ')' is there because of 5th rule

FOLLOW(E’) = FOLLOW(E) = { $, ) } // See 1st production rule

FOLLOW(T) = { FIRST(E’) – Є } U FOLLOW(E’) U FOLLOW(E) = { + , $ , ) }

FOLLOW(T’) = FOLLOW(T) = { + , $ , ) }

FOLLOW(F) = { FIRST(T’) – Є } U FOLLOW(T’) U FOLLOW(T) = { *, +, $, ) }

Or

a. What do you mean by context free grammar? give distinction between Regular

and context free grammer and limitation of context free grammer.

Answer: Context free grammar Context free grammar is a formal grammar which is used to generate all possible strings in a given formal language.

Context free grammar G can be defined by four tuples as:

1. G= (V, T, P, S)

Where,

G describes the grammar

T describes a finite set of terminal symbols.

V describes a finite set of non-terminal symbols

P describes a set of production rules

S is the start symbol.

In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly replacing a non-

terminal by the right hand side of the production, until all non-terminal have been replaced by terminal symbols.

Example:

L= {wcwR | w € (a, b)*}

Production rules:

1. S → aSa

2. S → bSb

3. S → c

Now check that abbcbba string can be derived from the given CFG.

1. S ⇒ aSa

2. S ⇒ abSba

3. S ⇒ abbSbba

4. S ⇒ abbcbba

By applying the production S → aSa, S → bSb recursively and finally applying the production S → c, we get the

string abbcbba.

Capabilities of CFG There are the various capabilities of CFG:

o Context free grammar is useful to describe most of the programming languages.

o If the grammar is properly designed then an efficientparser can be constructed automatically.

o Using the features of associatively & precedence information, suitable grammars for expressions can be

constructed.

o Context free grammar is capable of describing nested structures like: balanced parentheses, matching

begin-end, corresponding if-then-else's & so on.

give distinction between Regular and context free grammar

Definition

A regular expression is a concept in formal language theory which is a sequence of characters that define a search pattern.

Context Free Grammar is a type of formal grammar in formal language theory, which is a set of production rules that describe all

possible strings in a given formal language.

Usage

Regular expressions help to represent certain sets of string in an algebraic fashion. It helps to represent regular languages.

Context free grammar helps to define all the possible strings of a context free language.

Difference Between Rules

Regular and context-free grammars differ in the types of rules they allow. The rules of context-free grammars allow

possible sentences as combinations of unrelated individual words (which Chomsky calls “terminals”) and groups of

words (phrases, or what Chomsky calls “non-terminals”). Context-free grammars allow individual words and

phrases in any order and allow sentences with any number of individual words and phrases. Regular grammars, on

the other hand, allow only individual words along with a single phrase per sentence. Furthermore, phrases in regular

grammars must appear in the same position in every sentence or phrase, generated by the grammar.

b. Show whether the following grammer is LL(1) or not.

E -> TE'

E' -> +TE'|ε

T -> FT'

T' -> *FT'|ε

F -> (E)|id

Answer:

UNIT III

Q.3 a. write a program to translate an infix expression into postfix form also write the syntax

directed definition for the same.

Answer:

/* C++ implementation to convert infix expression to postfix*/

// Note that here we use std::stack for Stack operations

#include<bits/stdc++.h>

using namespace std;

//Function to return precedence of operators

int prec(char c)

{

if(c == '^')

return 3;

else if(c == '*' || c == '/')

return 2;

else if(c == '+' || c == '-')

return 1;

else

return -1;

}

// The main function to convert infix expression

//to postfix expression

void infixToPostfix(string s)

{

std::stack<char> st;

st.push('N');

int l = s.length();

string ns;

for(int i = 0; i < l; i++)

{

// If the scanned character is an operand, add it to output string.

if((s[i] >= 'a' && s[i] <= 'z')||(s[i] >= 'A' && s[i] <= 'Z'))

ns+=s[i];

// If the scanned character is an ‘(‘, push it to the stack.

else if(s[i] == '(')

st.push('(');

// If the scanned character is an ‘)’, pop and to output string from the stack

// until an ‘(‘ is encountered.

else if(s[i] == ')')

{

while(st.top() != 'N' && st.top() != '(')

{

char c = st.top();

st.pop();

ns += c;

}

if(st.top() == '(')

{

char c = st.top();

st.pop();

}

}

//If an operator is scanned

else{

while(st.top() != 'N' && prec(s[i]) <= prec(st.top()))

{

char c = st.top();

st.pop();

ns += c;

}

st.push(s[i]);

}

}

//Pop all the remaining elements from the stack

while(st.top() != 'N')

{

char c = st.top();

st.pop();

ns += c;

}

cout << ns << endl;

}

//Driver program to test above functions

int main()

{

string exp = "a+b*(c^d-e)^(f+g*h)-i";

infixToPostfix(exp);

return 0;

}

// This code is contributed by Gautam Singh

Output:

abcd^e-fgh*+^*+i-

Syntax Directed Translation

Syntax-directed translation refers to a method of compiler implementation where the source language translation is

completely driven by the parser. In other words, the parsing process and parse trees are used to direct semantic analysis and the

translation of the source program to translate and evaluate .

Basically we try to integrate the generation of parse tree and the evaluation by traversing the parse tree .

Production : A -> XY

Semantic Rule : A.a = f(X.b,Y.c)

where a is attribute associated to A ,b is attribute associated with X and c is attribute associated with Y .

SDT : XY { Action or code program fragment }

In above SDT rule the position of action varies as per the type of SDD ( L-attributed or S-attributed ) and the type of

attribute ( synthesized or inherited ) to be evaluated .

SDT for infix to postfix conversion of expression for given grammar :

Grammar :

E -> E + T { print(‘+’) }

E -> E – T { print(‘-‘) }

E -> T { }

T -> id { print(‘id’) }

SDT for evaluation of expression for given grammar :

Grammar :

L -> E { print(E.val) }

E -> E + T { El.val = Er.val + T.val }

E -> T { E.val = T.val }

T -> T * F { Tl.val = Tr.val * F.val }

T -> F { T.val = F.val }

F -> id { F.val = id.lexval }

Get Equivalent SDT from :

1) S – Attributed SDD :

For all semantic rules , generate action to compute synthesized attribute of Non-Terminal in the head of

production from the synthesized attributes of Non – Terminals in the body of production .

Rules :

1) Translate those semantic rules to equivalent code fragments or Action .

2) Place those actions at the end of the production.

Above SDT is an example of conversion of S-attributed SDD to equivalent SDT .We always append the action to the

end of the production and this type of SDT is called Postfix SDT .

To evaluate the expressions we use value stack for evaluation along with symbol stack which is used for parsing .

Using above SDT we translate the i/p expression : 3*4

Symbol Stack Value Stack Input String Syntax Action Semantic Action

$ $ 3*4 Shift

$id $3 *4 Reduce

F -> id

F.val = id.val

$F $3 *4 Reduce

T -> F

T.val = F.val

$T $3 *4 Shift

$T* $3* 4 Shift

$T*id $3*4 $ Reduce

F -> id

F.val = id.val

$T*F $3*4 $ Reduce

T -> T*F

T.val = T.val *F.val

$T $12 $ Reduce

E -> T

E.val = T.val

$E $12 $ Reduce

L -> E

Print(E.val)

2) L-attributed SDD :

Let product stated below is a rule of LL(1) grammar

Rule : A -> X1X2X3

Semantic Rule : X2.b_inh = F(X1.a_sym,A.x_inh)

A.x_sym = F(X1.a_sym,X2.b_sym,X3.c_sym)

Where all of the Non-terminals may have synthesized and inherited attributes .

SDT :

A -> X1 { X2.b_inh = f(X1.a_sym,A.x_inh) } X2X3

{ A.x_sym = F(X1.a_sym,X2.b_sym,X3.c_sym) }

To evaluate the synthesized attribute we append action at the end of the production and to evaluate the

inherited attribute we append the action before the non-terminal symbol whose attribute has to be calculated .

c. write specification of simple type checker with example

Specification of a Simple Type Checker

We consider the language generated by the following grammar.

P D; E

D D; D

D : T

T

T

T [ ] T

T T

T T T

E

E

E

E E E

E E[E]

E E

E E(E)

A sentence of this language is a Program.

A Program consists of a sequence of Declarations followed by an Expression.

character and integer are the basic types whereas literal and num stand for

elements of these types.

is the token for identifiers.

[ ] T is an array type construct whereas E[E]

refers to an element of an array.

T is a pointer type construct whereas E is a pointer dereference.

T T is a function type construct whereas E(E) is a function call.

E E represents a remainder computation.

We consider the following attributes.

Grammar symbol Synthesized attribute Inherited attribute

E E.type (type expression)

T T.type (type expression)

.entry

.val

.val character

First we give a translation scheme that saves the type of an identifier.

D : T

{ addtype(id.entry, T.type) }

T { T.type := }

T { T.type := }

T [ ]

T1

{ T.type := array (0 ...

.val -

1, T1.type) }

T T1

{ T.type := pointer(T1) }

T T1 T2 { T.type := (T1.type T2.type)

Note that because of the rule

D : T {addtype( .entry, T.type)}

(1)

the types of all identifiers are saved in the symbol table before the expression

generated by E is checked. Now we give a translation scheme for type checking rules

for expressions. We assume that lookup retrieves the type information about an entry

of the symbol table.

E

{ E.type :=

}

E

{ E.type :=

}

E

{ E.type := lookup(id.entry) }

E

E1

E2

{ E.type :=

if E1.type =

and E2.type =

then

else

}

E

E1[E2]

{ E.type :=

if E1.type =

and E2.type = array(n, T)

then T

else

}

E

E1

{ E.type := if E1.type = pointer(T)

then T

else

}

E

E1(E2)

{ E.type := if E2.type = S

and E1.type = S T

then T

else

}

Now we extend our language by stating that

A Program consists of a sequence of Declarations followed by a sequence

of Statements.

A statement is either an assignment, an alternative or a while loop.

boolean is a basic type.

An expression could be E1 E2 which evaluates to boolean provided

that E1 and E2 have the same basic type, otherwise evaluates to type_error.

Therefore we obtain the following new grammar.

P D; S

D D; D

D : T

T

T

T

T [

] T

T T

E

E

E

E E E

E E E

E E[E]

E E

S S1; S2

S := E

S E

S

S

E S

Extending the translation scheme for statements leads to:

S S1; S2

{ S.type := if S1.type = void

and S2.type = void

then void

else

}

S

:= E

{ S.type :=

if E.type = .type

then void

else

}

S

E S1

{ S.type := if E.type = boolean

then S1.type

else

}

S

E

S1

{ S.type := if E.type = boolean

then S1.type

else

}

OR

a. Explain the syntax directed translation scheme in details

Answer: Syntax directed translation scheme o The Syntax directed translation scheme is a context -free grammar.

o The syntax directed translation scheme is used to evaluate the order of semantic rules.

o In translation scheme, the semantic rules are embedded within the right side of the productions.

o The position at which an action is to be executed is shown by enclosed between braces. It is written within the right side of the production.

Example

Production Semantic Rules

S → E $ { printE.VAL }

E → E + E {E.VAL := E.VAL + E.VAL }

E → E * E {E.VAL := E.VAL * E.VAL }

E → (E) {E.VAL := E.VAL }

E → I {E.VAL := I.VAL }

I → I digit {I.VAL := 10 * I.VAL + LEXVAL }

I → digit { I.VAL:= LEXVAL}

Implementation of Syntax directed translation

Syntax direct translation is implemented by constructing a parse tree and performing the actions in a left to right depth first order.

SDT is implementing by parse the input and produce a parse tree as a result.

Example

Production Semantic Rules

S → E $ { printE.VAL }

E → E + E {E.VAL := E.VAL + E.VAL }

E → E * E {E.VAL := E.VAL * E.VAL }

E → (E) {E.VAL := E.VAL }

E → I {E.VAL := I.VAL }

I → I digit {I.VAL := 10 * I.VAL + LEXVAL }

I → digit { I.VAL:= LEXVAL}

Parse tree for SDT:

b. Write the process and importance of intermediate code generation.

Answer: A source code can directly be translated into its target machine code, then why at all we

need to translate the source code into an intermediate code which is then translated to its target

code? Let us see the reasons why we need an intermediate code.

If a compiler translates the source language to its target machine language without having the option for generating intermediate

code, then for each new machine, a full native compiler is required.

Intermediate code eliminates the need of a new full compiler for every unique machine by keeping the analysis portion same for all

the compilers.

The second part of compiler, synthesis, is changed according to the target machine.

It becomes easier to apply the source code modifications to improve code performance by applying code optimization techniques on

the intermediate code.

Intermediate Representation

Intermediate codes can be represented in a variety of ways and they have their own benefits.

High Level IR - High-level intermediate code representation is very close to the source language itself. They can be easily generated

from the source code and we can easily apply code modifications to enhance performance. But for target machine optimization, it is

less preferred.

Low Level IR - This one is close to the target machine, which makes it suitable for register and memory allocation, instruction set

selection, etc. It is good for machine-dependent optimizations.

Intermediate code can be either language specific (e.g., Byte Code for Java) or language

independent (three-address code).

Three-Address Code

Intermediate code generator receives input from its predecessor phase, semantic analyzer, in the

form of an annotated syntax tree. That syntax tree then can be converted into a linear

representation, e.g., postfix notation. Intermediate code tends to be machine independent code.

Therefore, code generator assumes to have unlimited number of memory storage (register) to

generate code.

For example:

a = b + c * d;

The intermediate code generator will try to divide this expression into sub-expressions and then

generate the corresponding code.

r1 = c * d;

r2 = b + r1;

a = r2

r being used as registers in the target program.

A three-address code has at most three address locations to calculate the expression. A three-

address code can be represented in two forms : quadruples and triples.

Quadruples

Each instruction in quadruples presentation is divided into four fields: operator, arg1, arg2, and

result. The above example is represented below in quadruples format:

Op arg1 arg2 result

* c d r1

+ b r1 r2

+ r2 r1 r3

= r3 a

Triples

Each instruction in triples presentation has three fields : op, arg1, and arg2.The results of

respective sub-expressions are denoted by the position of expression. Triples represent

similarity with DAG and syntax tree. They are equivalent to DAG while representing

expressions.

Op arg1 arg2

* c d

+ b (0)

+ (1) (0)

= (2)

Triples face the problem of code immovability while optimization, as the results are positional

and changing the order or position of an expression may cause problems.

Indirect Triples

This representation is an enhancement over triples representation. It uses pointers instead of

position to store results. This enables the optimizers to freely re-position the sub-expression to

produce an optimized code.

Declarations

A variable or procedure has to be declared before it can be used. Declaration involves allocation

of space in memory and entry of type and name in the symbol table. A program may be coded

and designed keeping the target machine structure in mind, but it may not always be possible to

accurately convert a source code to its target language.

Taking the whole program as a collection of procedures and sub-procedures, it becomes

possible to declare all the names local to the procedure. Memory allocation is done in a

consecutive manner and names are allocated to memory in the sequence they are declared in the

program. We use offset variable and set it to zero {offset = 0} that denote the base address.

The source programming language and the target machine architecture may vary in the way

names are stored, so relative addressing is used. While the first name is allocated memory

starting from the memory location 0 {offset=0}, the next name declared later, should be

allocated memory next to the first one.

Example:

We take the example of C programming language where an integer variable is assigned 2 bytes

of memory and a float variable is assigned 4 bytes of memory.

int a;

float b;

Allocation process:

{offset = 0}

int a;

id.type = int

id.width = 2

offset = offset + id.width

{offset = 2}

float b;

id.type = float

id.width = 4

offset = offset + id.width

{offset = 6}

To enter this detail in a symbol table, a procedure enter can be used. This method may have the

following structure:

enter(name, type, offset)

This procedure should create an entry in the symbol table, for variable name, having its type set

to type and relative address offset in its data area.

UNIT IV

a. write short note on

Symbol table

Answer: Symbol table is an important data structure created and maintained by compilers in

order to store information about the occurrence of various entities such as variable names,

function names, objects, classes, interfaces, etc. Symbol table is used by both the analysis and

the synthesis parts of a compiler.

A symbol table may serve the following purposes depending upon the language in hand:

To store the names of all entities in a structured form at one place.

To verify if a variable has been declared.

To implement type checking, by verifying assignments and expressions in the source

code are semantically correct.

To determine the scope of a name (scope resolution).

A symbol table is simply a table which can be either linear or a hash table. It maintains an entry

for each name in the following format:

<symbol name, type, attribute>

For example, if a symbol table has to store information about the following variable declaration:

static int interest;

then it should store the entry such as:

<interest, int, static>

The attribute clause contains the entries related to the name.

Implementation

If a compiler is to handle a small amount of data, then the symbol table can be implemented as

an unordered list, which is easy to code, but it is only suitable for small tables only. A symbol

table can be implemented in one of the following ways:

Linear (sorted or unsorted) list

Binary Search Tree

Hash table

Among all, symbol tables are mostly implemented as hash tables, where the source code symbol

itself is treated as a key for the hash function and the return value is the information about the

symbol.

Operations

A symbol table, either linear or hash, should provide the following operations.

insert()

This operation is more frequently used by analysis phase, i.e., the first half of the compiler

where tokens are identified and names are stored in the table. This operation is used to add

information in the symbol table about unique names occurring in the source code. The format or

structure in which the names are stored depends upon the compiler in hand.

An attribute for a symbol in the source code is the information associated with that symbol. This

information contains the value, state, scope, and type about the symbol. The insert() function

takes the symbol and its attributes as arguments and stores the information in the symbol table.

For example:

int a;

should be processed by the compiler as:

insert(a, int);

lookup()

lookup() operation is used to search a name in the symbol table to determine:

if the symbol exists in the table.

if it is declared before it is being used.

if the name is used in the scope.

if the symbol is initialized.

if the symbol declared multiple times.

The format of lookup() function varies according to the programming language. The basic

format should match the following:

lookup(symbol)

This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol

exists in the symbol table, it returns its attributes stored in the table.

Scope Management

A compiler maintains two types of symbol tables: a global symbol table which can be accessed

by all the procedures and scope symbol tables that are created for each scope in the program.

storage allocation strategies

Ans. The different ways to allocate memory are:

1. Static storage allocation

2. Stack storage allocation

3. Heap storage allocation

Static storage allocation

o In static allocation, names are bound to storage locations.

o If memory is created at compile time then the memory will be created in static area and

only once.

o Static allocation supports the dynamic data structure that means memory is created only

at compile time and deallocated after program completion.

o The drawback with static storage allocation is that the size and position of data objects

should be known at compile time.

o Another drawback is restriction of the recursion procedure.

Stack Storage Allocation

o In static storage allocation, storage is organized as a stack.

o An activation record is pushed into the stack when activation begins and it is popped

when the activation end.

o Activation record contains the locals so that they are bound to fresh storage in each

activation record. The value of locals is deleted when the activation ends.

o It works on the basis of last-in-first-out (LIFO) and this allocation supports the recursion

process.

Heap Storage Allocation

o Heap allocation is the most flexible allocation scheme.

o Allocation and deallocation of memory can be done at any time and at any place

depending upon the user's requirement.

o Heap allocation is used to allocate memory to the variables dynamically and when the

variables are no more used then claim it back.

o Heap storage allocation supports the recursion process.

activation record

ans. Control stack is a run time stack which is used to keep track of the live procedure

activations i.e. it is used to find out the procedures whose execution have not been

completed.

o When it is called (activation begins) then the procedure name will push on to the stack

and when it returns (activation ends) then it will popped.

o Activation record is used to manage the information needed by a single execution of a

procedure.

o An activation record is pushed into the stack when a procedure is called and it is popped

when the control returns to the caller function.

Return Value: It is used by calling procedure to return a value to calling procedure.

Actual Parameter: It is used by calling procedures to supply parameters to the called procedures.

Control Link: It points to activation record of the caller.

Access Link: It is used to refer to non-local data held in other activation records.

Saved Machine Status: It holds the information about status of machine before the procedure is called.

Local Data: It holds the data that is local to the execution of the procedure.

Temporaries: It stores the value that arises in the evaluation of an expression.

parameter passing

Answer: Parameters

Formal parameter — the identifier used in a method to stand for the value that is passed into the

method by a caller. For example, amount is a formal parameter of processDeposit

Actual parameter — the actual value that is passed into the method by a caller.

o For example, the 200 used when processDeposit is called is an actual parameter.

o actual parameters are often called arguments

OR

a. Explain the following in detail

Nesting Depth and access Links

Nesting Depth

Outermost procedures have nesting depth 1. Other procedures have nesting depth 1 more than the nesting

depth of the immediately outer procedure. In the example above main has nesting depth 1; both f and g

have nesting depth 2.

Access Links

The AR for a nested procedure contains an access link that points to the AR of the (most recent activation

of the immediately outer procedure). So in the example above the access link for all activations of f and g

would point to the AR of the (only) activation of main. Then for a procedure P to access a name defined

in the 3-outer scope, i.e., the unique outer scope whose nesting depth is 3 less than that of P, you follow

the access links three times.

Data structures used in Symbol table

o A compiler contains two type of symbol table: global symbol table and scope symbol table.

o Global symbol table can be accessed by all the procedures and scope symbol table.

The scope of a name and symbol table is arranged in the hierarchy structure as shown below:

1. int value=10;

2.

3. void sum_num()

4. {

5. int num_1;

6. int num_2;

7.

8. {

9. int num_3;

10. int num_4;

11. }

12.

13. int num_5;

14.

15. {

16. int_num 6;

17. int_num 7;

18. }

19. }

20.

21. Void sum_id

22. {

23. int id_1;

24. int id_2;

25.

26. {

27. int id_3;

28. int id_4;

29. }

30.

31. int num_5;

32. }

The above grammar can be represented in a hierarchical data structure of symbol tables:

The global symbol table contains one global variable and two procedure names. The name mentioned in the sum_num table is not available for sum_id and its child tables.

Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to search the name in the symbol table then you can search it using the following algorithm:

o First a symbol is searched in the current symbol table.

o If the name is found then search is completed else the name will be searched in the symbol table of parent until,

o The name is found or global symbol is searched.

Representing Scope Information

In the source program, every name possesses a region of validity, called the scope of that name.

The rules in a block-structured language are as follows:

1. If a name declared within block B then it will be valid only within B.

2. If B1 block is nested within B2 then the name that is valid for block B2 is also valid for B1 unless the name's identifier is re-declared in B1.

o These scope rules need a more complicated organization of symbol table than a list of associations between names and attributes.

o Tables are organized into stack and each table contains the list of names and their associated attributes.

o Whenever a new block is entered then a new table is entered onto the stack. The new table holds the name that is declared as local to this block.

o When the declaration is compiled then the table is searched for a name.

o If the name is not found in the table then the new name is inserted.

o When the name's reference is translated then each table is searched, starting from the each table on the stack.

For example:

1. int x;

2. void f(int m) {

3. float x, y;

4. {

5. int i, j;

6. int u, v;

7. }

8. }

9. int g (int n)

10. {

11. bool t;

12. }

Fig: Symbol table organization that complies with static scope information rules

Static versus dynamic storage allocation

Answer: Static Versus Dynamic Storage Allocation

Much (often most) data cannot be statically allocated. Either its size is not known at compile time or its

lifetime is only a subset of the program's execution.

Early versions of Fortran used only statically allocated data. This required that each array had a constant

size specified in the program. Another consequence of supporting only static allocation was that recursion

was forbidden (otherwise the compiler could not tell how many versions of a variable would be needed).

Modern languages, including newer versions of Fortran, support both static and dynamic allocation of

memory.

The advantage supporting dynamic storage allocation is the increased flexibility and storage efficiency

possible (instead of declaring an array to have a size adequate for the largest data set; just allocate what is

needed). The advantage of static storage allocation is that it avoids the runtime costs for

allocation/deallocation and may permit faster code sequences for referencing the data.

An (unfortunately, all too common) error is a so-called memory leak where a long running program

repeated allocates memory that it fails to delete, even after it can no longer be referenced. To avoid

memory leaks and ease programming, several programming language systems employ automatic garbage

collection. That means the runtime system itself determines when data can no longer be referenced and

automatically deallocates it.

Activation Trees

Answer: Activation Tree

A program consist of procedures, a procedure definition is a declaration that, in its simplest form,

associates an identifier (procedure name) with a statement (body of the procedure). Each

execution of procedure is referred to as an activation of the procedure. Lifetime of an activation

is the sequence of steps present in the execution of the procedure. If ‘a’ and ‘b’ be two

procedures then their activations will be non-overlapping (when one is called after other) or

nested (nested procedures). A procedure is recursive if a new activation begins before an earlier

activation of the same procedure has ended. An activation tree shows the way control enters and

leaves activations.

Properties of activation trees are :-

Each node represents an activation of a procedure.

The root shows the activation of the main function.

The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and only if the control

flows from procedure x to procedure y.

Example – Consider the following program of Quicksort

main() {

Int n;

readarray();

quicksort(1,n);

}

quicksort(int m, int n) {

Int i= partition(m,n);

quicksort(m,i-1);

quicksort(i+1,n);

}

The activation tree for this program will be:

First main function as root then main calls readarray and quicksort. Quicksort in turn calls

partition and quicksort again. The flow of control in a program corresponds to the depth first

traversal of activation tree which starts at the root.

UNIT V

a. Construct the DAG and generate code for given block

d:=b+c

e:=a*b

b:=b-c

a:= e*d

b.What is the peephole Optimization explain it.

Ans. Peephole Optimization

This optimization technique works locally on the source code to transform it into an optimized

code. By locally, we mean a small portion of the code block at hand. These methods can be

applied on intermediate codes as well as on target codes. A bunch of statements is analyzed and

are checked for the following possible optimization:

Redundant instruction elimination

At source code level, the following can be done by the user:

int add_ten(int x)

{

int y, z;

y = 10;

z = x + y;

return z;

}

int add_ten(int x)

{

int y;

y = 10;

y = x + y;

return y;

}

int add_ten(int x)

{

int y = 10;

return x + y;

}

int add_ten(int x)

{

return x + 10;

}

At compilation level, the compiler searches for instructions redundant in nature. Multiple

loading and storing of instructions may carry the same meaning even if some of them are

removed. For example:

MOV x, R0

MOV R0, R1

We can delete the first instruction and re-write the sentence as:

MOV x, R1

Unreachable code

Unreachable code is a part of the program code that is never accessed because of programming

constructs. Programmers may have accidently written a piece of code that can never be reached.

Example:

void add_ten(int x)

{

return x + 10;

printf(“value of x is %d”, x);

}

In this code segment, the printf statement will never be executed as the program control returns

back before it can execute, hence printf can be removed.

Flow of control optimization

There are instances in a code where the program control jumps back and forth without

performing any significant task. These jumps can be removed. Consider the following chunk of

code:

...

MOV R1, R2

GOTO L1

...

L1 : GOTO L2

L2 : INC R1

In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1

and then to L2, the control can directly reach L2, as shown below:

...

MOV R1, R2

GOTO L2

...

L2 : INC R1

Algebraic expression simplification

There are occasions where algebraic expressions can be made simple. For example, the

expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply be

replaced by INC a.

Strength reduction

There are operations that consume more time and space. Their ‘strength’ can be reduced by

replacing them with other operations that consume less time and space, but produce the same

result.

For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the

output of a * a and a2 is same, a

2 is much more efficient to implement.

Accessing machine instructions

The target machine can deploy more sophisticated instructions, which can have the capability to

perform specific operations much efficiently. If the target code can accommodate those

instructions directly, that will not only improve the quality of code, but also yield more efficient

results.

OR

a. Construct the tree for following expressions and apply

labeling algorithm for ordering

x*(y+z)-z/(u-v)

b. Explain the basic block and control flow graph.

Answer:

Basic Blocks

Source codes generally have a number of instructions, which are always executed in sequence

and are considered as the basic blocks of the code. These basic blocks do not have any jump

statements among them, i.e., when the first instruction is executed, all the instructions in the

same basic block will be executed in their sequence of appearance without losing the flow

control of the program.

A program can have various constructs as basic blocks, like IF-THEN-ELSE, SWITCH-CASE

conditional statements and loops such as DO-WHILE, FOR, and REPEAT-UNTIL, etc.

Basic block identification

We may use the following algorithm to find the basic blocks in a program:

Search header statements of all the basic blocks from where a basic block starts:

o First statement of a program.

o Statements that are target of any branch (conditional/unconditional).

o Statements that follow any branch statement.

Header statements and the statements following them form a basic block.

A basic block does not include any header statement of any other basic block.

Basic blocks are important concepts from both code generation and optimization point of view.

Basic blocks play an important role in identifying variables, which are being used more than

once in a single basic block. If any variable is being used more than once, the register memory

allocated to that variable need not be emptied unless the block finishes execution.

Control Flow Graph

Basic blocks in a program can be represented by means of control flow graphs. A control flow

graph depicts how the program control is being passed among the blocks. It is a useful tool that

helps in optimization by help locating any unwanted loops in the program.

Compiler Phases - GIT · Compiler Phases The compilation process contains the sequence of various...

Documents

Transcript of Compiler Phases - GIT · Compiler Phases The compilation process contains the sequence of various...