Compiler Phases - GIT · Compiler Phases The compilation process contains the sequence of various...
Transcript of Compiler Phases - GIT · Compiler Phases The compilation process contains the sequence of various...
RTU Exam Paper 2019
7th
Sem CSE
Subject: Compiler construction
UNIT-I
Q.1 a. Explain the different phases of compiler with the help of suitable diagram.
Answer:
Compiler Phases
The compilation process contains the sequence of various phases. Each phase takes source program in one
representation and produces output in another representation. Each phase takes input from its previous stage.
There are the various phases of compiler:
Fig: phases of compiler
Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It reads the source
program one character at a time and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens.
Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as input and generates a parse tree as
output. In syntax analysis phase, the parser checks that the expression made by the tokens is syntactically correct or
not.
Semantic Analysis
Semantic analysis is the third phase of compilation process. It checks whether the parse tree follows the rules of
language. Semantic analyzer keeps track of identifiers, their types and expressions. The output of semantic analysis
phase is the annotated tree syntax.
Intermediate Code Generation
In the intermediate code generation, compiler generates the source code into the intermediate code. Intermediate
code is generated between the high-level language and the machine language. The intermediate code should be
generated in such a way that you can easily translate it into the target machine code.
Code Optimization
Code optimization is an optional phase. It is used to improve the intermediate code so that the output of the program
could run faster and take less space. It removes the unnecessary lines of the code and arranges the sequence of
statements in order to speed up the program execution.
Code Generation
Code generation is the final stage of the compilation process. It takes the optimized intermediate code as input and
maps it to the target machine language. Code generator translates the intermediate code into the machine code of the
specified computer.
Example:
Translators
Answer: The most general term for a software code converting tool is “translator.” A translator,
in software programming terms, is a generic term that could refer to a compiler, assembler, or
interpreter; anything that converts higher level code into another high-level code (e.g., Basic,
C++, Fortran, Java) or lower-level (i.e., a language that the processor can understand), such as
assembly language or machine code. If you don’t know what the tool actually does other than
that it accomplishes some level of code conversion to a specific target language, then you can
safely call it a translator.
Interpreters
Answer:
The translation of single statement of source program into machine code is done by language processor
and executes it immediately before moving on to the next line is called an interpreter. If there is an error
in the statement, the interpreter terminates its translating process at that statement and displays an error
message. The interpreter moves on to the next line for execution only after removal of the error. An
Interpreter directly executes instructions written in a programming or scripting language without
previously converting them to an object code or machine code.
Example: Perl, Python and Matlab.
Compiler
Answer:
A compiler is a translator that converts the high-level language into the machine language.
o High-level language is written by a developer and machine language can be understood by the processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without changing the meaning
of the program.
o When you execute a program which is written in HLL programming language then it executes into two
parts.
o In the first part, the source program compiled and translated into the object program (low level language).
o In the second part, object program translated into the target program through the assembler.
Fig: Execution process of source program in Compiler
Bootstrapping
o Bootstrapping is widely used in the compilation development.
o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of compiler that
can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled compiler to compile
everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
The T- diagram shows a compiler SCI
T for Source S, Target T, implemented in I.
Follow some steps to produce a new language L for machine A:
1. Create a compiler SCA
A for subset, S of the desired language, L using language "A" and that compiler runs on
machine A.
2. Create a compiler LCS
A for language L written in a subset of L.
3. Compile LCS
A using the compiler
SCA
A to obtain
LCA
A.
LCA
A is a compiler for language L, which runs on machine
A and produces code for machine A.
The process described by the T-diagrams is called bootstrapping.
c. illustrate the translation of the following statement on all compiler phases
A:- B*C+D/E
Answer:
Functions of the lexical Analyzer
Answer: The function of Lex is as follows:
o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs the lex.1
program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
Lex file format
A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows:
1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions what action the lexical analyzer should
take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the
lexical analyzer and compiled separately.
b. Construct minimum state DFA’s for the following regular expression
(a/b)*a(a/b)
(a/b)*a(a/b)(a/b)
(a/b)*a(a/b)(a/b)(a/b)
UNIT II
Q.1 a. explain top down and bottom up parsing techniques in details
Answer.
Parser
Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis phase.
A parser takes input in the form of sequence of tokens and produces output in the form of parse tree.
Parsing is of two types: top down parsing and bottom up parsing.
Top down paring
o The top down parsing is known as recursive parsing or predictive parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the top down parsing, the parsing starts from the start symbol and transform it into the input symbol.
Parse Tree representation of input string "acdb" is as follows:
Bottom up parsing
o Bottom up parsing is also known as shift-reduce parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the bottom up parsing, the parsing starts with the input symbol and construct the parse tree up to the start
symbol by tracing out the rightmost derivations of string in reverse.
Example
Production
1. E → T
2. T → T * F
3. T → id
4. F → T
5. F → id
Parse Tree representation of input string "id * id" is as follows:
Bottom up parsing is classified in to various parsing. These are as follows:
1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing
a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR( 1 )
b. construct an LL(0) parse table grammar calculate FIRST and Follow as needed
Anwser:
Rules to compute FIRST set:
1. If x is a terminal, then FIRST(x) = { ‘x’ }
2. If x-> Є, is a production rule, then add Є to FIRST(x).
3. If X->Y1 Y2 Y3….Yn is a production,
1. FIRST(X) = FIRST(Y1)
2. If FIRST(Y1) contains Є then FIRST(X) = { FIRST(Y1) – Є } U { FIRST(Y2) }
3. If FIRST (Yi) contains Є for all i = 1 to n, then add Є to FIRST(X).
Rules to compute FOLLOW set:
1) FOLLOW(S) = { $ } // where S is the starting Non-Terminal
2) If A -> pBq is a production, where p, B and q are any grammar symbols,
then everything in FIRST(q) except Є is in FOLLOW(B).
3) If A->pB is a production, then everything in FOLLOW(A) is in FOLLOW(B).
4) If A->pBq is a production and FIRST(q) contains Є,
then FOLLOW(B) contains { FIRST(q) – Є } U FOLLOW(A)
Example:
Production Rules:
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
F -> (E) | id
FIRST set
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, Є }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, Є }
FIRST(F) = { ( , id }
FOLLOW Set
FOLLOW(E) = { $ , ) } // Note ')' is there because of 5th rule
FOLLOW(E’) = FOLLOW(E) = { $, ) } // See 1st production rule
FOLLOW(T) = { FIRST(E’) – Є } U FOLLOW(E’) U FOLLOW(E) = { + , $ , ) }
FOLLOW(T’) = FOLLOW(T) = { + , $ , ) }
FOLLOW(F) = { FIRST(T’) – Є } U FOLLOW(T’) U FOLLOW(T) = { *, +, $, ) }
Or
a. What do you mean by context free grammar? give distinction between Regular
and context free grammer and limitation of context free grammer.
Answer: Context free grammar Context free grammar is a formal grammar which is used to generate all possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
1. G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.
In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly replacing a non-
terminal by the right hand side of the production, until all non-terminal have been replaced by terminal symbols.
Example:
L= {wcwR | w € (a, b)*}
Production rules:
1. S → aSa
2. S → bSb
3. S → c
Now check that abbcbba string can be derived from the given CFG.
1. S ⇒ aSa
2. S ⇒ abSba
3. S ⇒ abbSbba
4. S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and finally applying the production S → c, we get the
string abbcbba.
Capabilities of CFG There are the various capabilities of CFG:
o Context free grammar is useful to describe most of the programming languages.
o If the grammar is properly designed then an efficientparser can be constructed automatically.
o Using the features of associatively & precedence information, suitable grammars for expressions can be
constructed.
o Context free grammar is capable of describing nested structures like: balanced parentheses, matching
begin-end, corresponding if-then-else's & so on.
give distinction between Regular and context free grammar
Definition
A regular expression is a concept in formal language theory which is a sequence of characters that define a search pattern.
Context Free Grammar is a type of formal grammar in formal language theory, which is a set of production rules that describe all
possible strings in a given formal language.
Usage
Regular expressions help to represent certain sets of string in an algebraic fashion. It helps to represent regular languages.
Context free grammar helps to define all the possible strings of a context free language.
Difference Between Rules
Regular and context-free grammars differ in the types of rules they allow. The rules of context-free grammars allow
possible sentences as combinations of unrelated individual words (which Chomsky calls “terminals”) and groups of
words (phrases, or what Chomsky calls “non-terminals”). Context-free grammars allow individual words and
phrases in any order and allow sentences with any number of individual words and phrases. Regular grammars, on
the other hand, allow only individual words along with a single phrase per sentence. Furthermore, phrases in regular
grammars must appear in the same position in every sentence or phrase, generated by the grammar.
b. Show whether the following grammer is LL(1) or not.
E -> TE'
E' -> +TE'|ε
T -> FT'
T' -> *FT'|ε
F -> (E)|id
Answer:
UNIT III
Q.3 a. write a program to translate an infix expression into postfix form also write the syntax
directed definition for the same.
Answer:
/* C++ implementation to convert infix expression to postfix*/
// Note that here we use std::stack for Stack operations
#include<bits/stdc++.h>
using namespace std;
//Function to return precedence of operators
int prec(char c)
{
if(c == '^')
return 3;
else if(c == '*' || c == '/')
return 2;
else if(c == '+' || c == '-')
return 1;
else
return -1;
}
// The main function to convert infix expression
//to postfix expression
void infixToPostfix(string s)
{
std::stack<char> st;
st.push('N');
int l = s.length();
string ns;
for(int i = 0; i < l; i++)
{
// If the scanned character is an operand, add it to output string.
if((s[i] >= 'a' && s[i] <= 'z')||(s[i] >= 'A' && s[i] <= 'Z'))
ns+=s[i];
// If the scanned character is an ‘(‘, push it to the stack.
else if(s[i] == '(')
st.push('(');
// If the scanned character is an ‘)’, pop and to output string from the stack
// until an ‘(‘ is encountered.
else if(s[i] == ')')
{
while(st.top() != 'N' && st.top() != '(')
{
char c = st.top();
st.pop();
ns += c;
}
if(st.top() == '(')
{
char c = st.top();
st.pop();
}
}
//If an operator is scanned
else{
while(st.top() != 'N' && prec(s[i]) <= prec(st.top()))
{
char c = st.top();
st.pop();
ns += c;
}
st.push(s[i]);
}
}
//Pop all the remaining elements from the stack
while(st.top() != 'N')
{
char c = st.top();
st.pop();
ns += c;
}
cout << ns << endl;
}
//Driver program to test above functions
int main()
{
string exp = "a+b*(c^d-e)^(f+g*h)-i";
infixToPostfix(exp);
return 0;
}
// This code is contributed by Gautam Singh
Output:
abcd^e-fgh*+^*+i-
Syntax Directed Translation
Syntax-directed translation refers to a method of compiler implementation where the source language translation is
completely driven by the parser. In other words, the parsing process and parse trees are used to direct semantic analysis and the
translation of the source program to translate and evaluate .
Basically we try to integrate the generation of parse tree and the evaluation by traversing the parse tree .
Production : A -> XY
Semantic Rule : A.a = f(X.b,Y.c)
where a is attribute associated to A ,b is attribute associated with X and c is attribute associated with Y .
SDT : XY { Action or code program fragment }
In above SDT rule the position of action varies as per the type of SDD ( L-attributed or S-attributed ) and the type of
attribute ( synthesized or inherited ) to be evaluated .
SDT for infix to postfix conversion of expression for given grammar :
Grammar :
E -> E + T { print(‘+’) }
E -> E – T { print(‘-‘) }
E -> T { }
T -> id { print(‘id’) }
SDT for evaluation of expression for given grammar :
Grammar :
L -> E { print(E.val) }
E -> E + T { El.val = Er.val + T.val }
E -> T { E.val = T.val }
T -> T * F { Tl.val = Tr.val * F.val }
T -> F { T.val = F.val }
F -> id { F.val = id.lexval }
Get Equivalent SDT from :
1) S – Attributed SDD :
For all semantic rules , generate action to compute synthesized attribute of Non-Terminal in the head of
production from the synthesized attributes of Non – Terminals in the body of production .
Rules :
1) Translate those semantic rules to equivalent code fragments or Action .
2) Place those actions at the end of the production.
Above SDT is an example of conversion of S-attributed SDD to equivalent SDT .We always append the action to the
end of the production and this type of SDT is called Postfix SDT .
To evaluate the expressions we use value stack for evaluation along with symbol stack which is used for parsing .
Using above SDT we translate the i/p expression : 3*4
Symbol Stack Value Stack Input String Syntax Action Semantic Action
$ $ 3*4 Shift
$id $3 *4 Reduce
F -> id
F.val = id.val
$F $3 *4 Reduce
T -> F
T.val = F.val
$T $3 *4 Shift
$T* $3* 4 Shift
$T*id $3*4 $ Reduce
F -> id
F.val = id.val
$T*F $3*4 $ Reduce
T -> T*F
T.val = T.val *F.val
$T $12 $ Reduce
E -> T
E.val = T.val
$E $12 $ Reduce
L -> E
Print(E.val)
2) L-attributed SDD :
Let product stated below is a rule of LL(1) grammar
Rule : A -> X1X2X3
Semantic Rule : X2.b_inh = F(X1.a_sym,A.x_inh)
A.x_sym = F(X1.a_sym,X2.b_sym,X3.c_sym)
Where all of the Non-terminals may have synthesized and inherited attributes .
SDT :
A -> X1 { X2.b_inh = f(X1.a_sym,A.x_inh) } X2X3
{ A.x_sym = F(X1.a_sym,X2.b_sym,X3.c_sym) }
To evaluate the synthesized attribute we append action at the end of the production and to evaluate the
inherited attribute we append the action before the non-terminal symbol whose attribute has to be calculated .
c. write specification of simple type checker with example
Specification of a Simple Type Checker
We consider the language generated by the following grammar.
P D; E
D D; D
D : T
T
T
T [ ] T
T T
T T T
E
E
E
E E E
E E[E]
E E
E E(E)
A sentence of this language is a Program.
A Program consists of a sequence of Declarations followed by an Expression.
character and integer are the basic types whereas literal and num stand for
elements of these types.
is the token for identifiers.
[ ] T is an array type construct whereas E[E]
refers to an element of an array.
T is a pointer type construct whereas E is a pointer dereference.
T T is a function type construct whereas E(E) is a function call.
E E represents a remainder computation.
We consider the following attributes.
Grammar symbol Synthesized attribute Inherited attribute
E E.type (type expression)
T T.type (type expression)
.entry
.val
.val character
First we give a translation scheme that saves the type of an identifier.
D : T
{ addtype(id.entry, T.type) }
T { T.type := }
T { T.type := }
T [ ]
T1
{ T.type := array (0 ...
.val -
1, T1.type) }
T T1
{ T.type := pointer(T1) }
T T1 T2 { T.type := (T1.type T2.type)
Note that because of the rule
D : T {addtype( .entry, T.type)}
(1)
the types of all identifiers are saved in the symbol table before the expression
generated by E is checked. Now we give a translation scheme for type checking rules
for expressions. We assume that lookup retrieves the type information about an entry
of the symbol table.
E
{ E.type :=
}
E
{ E.type :=
}
E
{ E.type := lookup(id.entry) }
E
E1
E2
{ E.type :=
if E1.type =
and E2.type =
then
else
}
E
E1[E2]
{ E.type :=
if E1.type =
and E2.type = array(n, T)
then T
else
}
E
E1
{ E.type := if E1.type = pointer(T)
then T
else
}
E
E1(E2)
{ E.type := if E2.type = S
and E1.type = S T
then T
else
}
Now we extend our language by stating that
A Program consists of a sequence of Declarations followed by a sequence
of Statements.
A statement is either an assignment, an alternative or a while loop.
boolean is a basic type.
An expression could be E1 E2 which evaluates to boolean provided
that E1 and E2 have the same basic type, otherwise evaluates to type_error.
Therefore we obtain the following new grammar.
P D; S
D D; D
D : T
T
T
S S1; S2
S := E
S E
S
S
E S
Extending the translation scheme for statements leads to:
S S1; S2
{ S.type := if S1.type = void
and S2.type = void
then void
else
}
S
:= E
{ S.type :=
if E.type = .type
then void
else
}
S
E S1
{ S.type := if E.type = boolean
then S1.type
else
}
S
E
S1
{ S.type := if E.type = boolean
then S1.type
else
}
OR
a. Explain the syntax directed translation scheme in details
Answer: Syntax directed translation scheme o The Syntax directed translation scheme is a context -free grammar.
o The syntax directed translation scheme is used to evaluate the order of semantic rules.
o In translation scheme, the semantic rules are embedded within the right side of the productions.
o The position at which an action is to be executed is shown by enclosed between braces. It is written within the right side of the production.
Example
Production Semantic Rules
S → E $ { printE.VAL }
E → E + E {E.VAL := E.VAL + E.VAL }
E → E * E {E.VAL := E.VAL * E.VAL }
E → (E) {E.VAL := E.VAL }
E → I {E.VAL := I.VAL }
I → I digit {I.VAL := 10 * I.VAL + LEXVAL }
I → digit { I.VAL:= LEXVAL}
Implementation of Syntax directed translation
Syntax direct translation is implemented by constructing a parse tree and performing the actions in a left to right depth first order.
SDT is implementing by parse the input and produce a parse tree as a result.
Example
Production Semantic Rules
S → E $ { printE.VAL }
E → E + E {E.VAL := E.VAL + E.VAL }
E → E * E {E.VAL := E.VAL * E.VAL }
E → (E) {E.VAL := E.VAL }
E → I {E.VAL := I.VAL }
I → I digit {I.VAL := 10 * I.VAL + LEXVAL }
I → digit { I.VAL:= LEXVAL}
Parse tree for SDT:
b. Write the process and importance of intermediate code generation.
Answer: A source code can directly be translated into its target machine code, then why at all we
need to translate the source code into an intermediate code which is then translated to its target
code? Let us see the reasons why we need an intermediate code.
If a compiler translates the source language to its target machine language without having the option for generating intermediate
code, then for each new machine, a full native compiler is required.
Intermediate code eliminates the need of a new full compiler for every unique machine by keeping the analysis portion same for all
the compilers.
The second part of compiler, synthesis, is changed according to the target machine.
It becomes easier to apply the source code modifications to improve code performance by applying code optimization techniques on
the intermediate code.
Intermediate Representation
Intermediate codes can be represented in a variety of ways and they have their own benefits.
High Level IR - High-level intermediate code representation is very close to the source language itself. They can be easily generated
from the source code and we can easily apply code modifications to enhance performance. But for target machine optimization, it is
less preferred.
Low Level IR - This one is close to the target machine, which makes it suitable for register and memory allocation, instruction set
selection, etc. It is good for machine-dependent optimizations.
Intermediate code can be either language specific (e.g., Byte Code for Java) or language
independent (three-address code).
Three-Address Code
Intermediate code generator receives input from its predecessor phase, semantic analyzer, in the
form of an annotated syntax tree. That syntax tree then can be converted into a linear
representation, e.g., postfix notation. Intermediate code tends to be machine independent code.
Therefore, code generator assumes to have unlimited number of memory storage (register) to
generate code.
For example:
a = b + c * d;
The intermediate code generator will try to divide this expression into sub-expressions and then
generate the corresponding code.
r1 = c * d;
r2 = b + r1;
a = r2
r being used as registers in the target program.
A three-address code has at most three address locations to calculate the expression. A three-
address code can be represented in two forms : quadruples and triples.
Quadruples
Each instruction in quadruples presentation is divided into four fields: operator, arg1, arg2, and
result. The above example is represented below in quadruples format:
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
Triples
Each instruction in triples presentation has three fields : op, arg1, and arg2.The results of
respective sub-expressions are denoted by the position of expression. Triples represent
similarity with DAG and syntax tree. They are equivalent to DAG while representing
expressions.
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results are positional
and changing the order or position of an expression may cause problems.
Indirect Triples
This representation is an enhancement over triples representation. It uses pointers instead of
position to store results. This enables the optimizers to freely re-position the sub-expression to
produce an optimized code.
Declarations
A variable or procedure has to be declared before it can be used. Declaration involves allocation
of space in memory and entry of type and name in the symbol table. A program may be coded
and designed keeping the target machine structure in mind, but it may not always be possible to
accurately convert a source code to its target language.
Taking the whole program as a collection of procedures and sub-procedures, it becomes
possible to declare all the names local to the procedure. Memory allocation is done in a
consecutive manner and names are allocated to memory in the sequence they are declared in the
program. We use offset variable and set it to zero {offset = 0} that denote the base address.
The source programming language and the target machine architecture may vary in the way
names are stored, so relative addressing is used. While the first name is allocated memory
starting from the memory location 0 {offset=0}, the next name declared later, should be
allocated memory next to the first one.
Example:
We take the example of C programming language where an integer variable is assigned 2 bytes
of memory and a float variable is assigned 4 bytes of memory.
int a;
float b;
Allocation process:
{offset = 0}
int a;
id.type = int
id.width = 2
offset = offset + id.width
{offset = 2}
float b;
id.type = float
id.width = 4
offset = offset + id.width
{offset = 6}
To enter this detail in a symbol table, a procedure enter can be used. This method may have the
following structure:
enter(name, type, offset)
This procedure should create an entry in the symbol table, for variable name, having its type set
to type and relative address offset in its data area.
UNIT IV
a. write short note on
Symbol table
Answer: Symbol table is an important data structure created and maintained by compilers in
order to store information about the occurrence of various entities such as variable names,
function names, objects, classes, interfaces, etc. Symbol table is used by both the analysis and
the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
To store the names of all entities in a structured form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and expressions in the source
code are semantically correct.
To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry
for each name in the following format:
<symbol name, type, attribute>
For example, if a symbol table has to store information about the following variable declaration:
static int interest;
then it should store the entry such as:
<interest, int, static>
The attribute clause contains the entries related to the name.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented as
an unordered list, which is easy to code, but it is only suitable for small tables only. A symbol
table can be implemented in one of the following ways:
Linear (sorted or unsorted) list
Binary Search Tree
Hash table
Among all, symbol tables are mostly implemented as hash tables, where the source code symbol
itself is treated as a key for the hash function and the return value is the information about the
symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler
where tokens are identified and names are stored in the table. This operation is used to add
information in the symbol table about unique names occurring in the source code. The format or
structure in which the names are stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol. This
information contains the value, state, scope, and type about the symbol. The insert() function
takes the symbol and its attributes as arguments and stores the information in the symbol table.
For example:
int a;
should be processed by the compiler as:
insert(a, int);
lookup()
lookup() operation is used to search a name in the symbol table to determine:
if the symbol exists in the table.
if it is declared before it is being used.
if the name is used in the scope.
if the symbol is initialized.
if the symbol declared multiple times.
The format of lookup() function varies according to the programming language. The basic
format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol
exists in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed
by all the procedures and scope symbol tables that are created for each scope in the program.
storage allocation strategies
Ans. The different ways to allocate memory are:
1. Static storage allocation
2. Stack storage allocation
3. Heap storage allocation
Static storage allocation
o In static allocation, names are bound to storage locations.
o If memory is created at compile time then the memory will be created in static area and
only once.
o Static allocation supports the dynamic data structure that means memory is created only
at compile time and deallocated after program completion.
o The drawback with static storage allocation is that the size and position of data objects
should be known at compile time.
o Another drawback is restriction of the recursion procedure.
Stack Storage Allocation
o In static storage allocation, storage is organized as a stack.
o An activation record is pushed into the stack when activation begins and it is popped
when the activation end.
o Activation record contains the locals so that they are bound to fresh storage in each
activation record. The value of locals is deleted when the activation ends.
o It works on the basis of last-in-first-out (LIFO) and this allocation supports the recursion
process.
Heap Storage Allocation
o Heap allocation is the most flexible allocation scheme.
o Allocation and deallocation of memory can be done at any time and at any place
depending upon the user's requirement.
o Heap allocation is used to allocate memory to the variables dynamically and when the
variables are no more used then claim it back.
o Heap storage allocation supports the recursion process.
activation record
ans. Control stack is a run time stack which is used to keep track of the live procedure
activations i.e. it is used to find out the procedures whose execution have not been
completed.
o When it is called (activation begins) then the procedure name will push on to the stack
and when it returns (activation ends) then it will popped.
o Activation record is used to manage the information needed by a single execution of a
procedure.
o An activation record is pushed into the stack when a procedure is called and it is popped
when the control returns to the caller function.
Return Value: It is used by calling procedure to return a value to calling procedure.
Actual Parameter: It is used by calling procedures to supply parameters to the called procedures.
Control Link: It points to activation record of the caller.
Access Link: It is used to refer to non-local data held in other activation records.
Saved Machine Status: It holds the information about status of machine before the procedure is called.
Local Data: It holds the data that is local to the execution of the procedure.
Temporaries: It stores the value that arises in the evaluation of an expression.
parameter passing
Answer: Parameters
Formal parameter — the identifier used in a method to stand for the value that is passed into the
method by a caller. For example, amount is a formal parameter of processDeposit
Actual parameter — the actual value that is passed into the method by a caller.
o For example, the 200 used when processDeposit is called is an actual parameter.
o actual parameters are often called arguments
OR
a. Explain the following in detail
Nesting Depth and access Links
Nesting Depth
Outermost procedures have nesting depth 1. Other procedures have nesting depth 1 more than the nesting
depth of the immediately outer procedure. In the example above main has nesting depth 1; both f and g
have nesting depth 2.
Access Links
The AR for a nested procedure contains an access link that points to the AR of the (most recent activation
of the immediately outer procedure). So in the example above the access link for all activations of f and g
would point to the AR of the (only) activation of main. Then for a procedure P to access a name defined
in the 3-outer scope, i.e., the unique outer scope whose nesting depth is 3 less than that of P, you follow
the access links three times.
Data structures used in Symbol table
o A compiler contains two type of symbol table: global symbol table and scope symbol table.
o Global symbol table can be accessed by all the procedures and scope symbol table.
The scope of a name and symbol table is arranged in the hierarchy structure as shown below:
1. int value=10;
2.
3. void sum_num()
4. {
5. int num_1;
6. int num_2;
7.
8. {
9. int num_3;
10. int num_4;
11. }
12.
13. int num_5;
14.
15. {
16. int_num 6;
17. int_num 7;
18. }
19. }
20.
21. Void sum_id
22. {
23. int id_1;
24. int id_2;
25.
26. {
27. int id_3;
28. int id_4;
29. }
30.
31. int num_5;
32. }
The above grammar can be represented in a hierarchical data structure of symbol tables:
The global symbol table contains one global variable and two procedure names. The name mentioned in the sum_num table is not available for sum_id and its child tables.
Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to search the name in the symbol table then you can search it using the following algorithm:
o First a symbol is searched in the current symbol table.
o If the name is found then search is completed else the name will be searched in the symbol table of parent until,
o The name is found or global symbol is searched.
Representing Scope Information
In the source program, every name possesses a region of validity, called the scope of that name.
The rules in a block-structured language are as follows:
1. If a name declared within block B then it will be valid only within B.
2. If B1 block is nested within B2 then the name that is valid for block B2 is also valid for B1 unless the name's identifier is re-declared in B1.
o These scope rules need a more complicated organization of symbol table than a list of associations between names and attributes.
o Tables are organized into stack and each table contains the list of names and their associated attributes.
o Whenever a new block is entered then a new table is entered onto the stack. The new table holds the name that is declared as local to this block.
o When the declaration is compiled then the table is searched for a name.
o If the name is not found in the table then the new name is inserted.
o When the name's reference is translated then each table is searched, starting from the each table on the stack.
For example:
1. int x;
2. void f(int m) {
3. float x, y;
4. {
5. int i, j;
6. int u, v;
7. }
8. }
9. int g (int n)
10. {
11. bool t;
12. }
Fig: Symbol table organization that complies with static scope information rules
Static versus dynamic storage allocation
Answer: Static Versus Dynamic Storage Allocation
Much (often most) data cannot be statically allocated. Either its size is not known at compile time or its
lifetime is only a subset of the program's execution.
Early versions of Fortran used only statically allocated data. This required that each array had a constant
size specified in the program. Another consequence of supporting only static allocation was that recursion
was forbidden (otherwise the compiler could not tell how many versions of a variable would be needed).
Modern languages, including newer versions of Fortran, support both static and dynamic allocation of
memory.
The advantage supporting dynamic storage allocation is the increased flexibility and storage efficiency
possible (instead of declaring an array to have a size adequate for the largest data set; just allocate what is
needed). The advantage of static storage allocation is that it avoids the runtime costs for
allocation/deallocation and may permit faster code sequences for referencing the data.
An (unfortunately, all too common) error is a so-called memory leak where a long running program
repeated allocates memory that it fails to delete, even after it can no longer be referenced. To avoid
memory leaks and ease programming, several programming language systems employ automatic garbage
collection. That means the runtime system itself determines when data can no longer be referenced and
automatically deallocates it.
Activation Trees
Answer: Activation Tree
A program consist of procedures, a procedure definition is a declaration that, in its simplest form,
associates an identifier (procedure name) with a statement (body of the procedure). Each
execution of procedure is referred to as an activation of the procedure. Lifetime of an activation
is the sequence of steps present in the execution of the procedure. If ‘a’ and ‘b’ be two
procedures then their activations will be non-overlapping (when one is called after other) or
nested (nested procedures). A procedure is recursive if a new activation begins before an earlier
activation of the same procedure has ended. An activation tree shows the way control enters and
leaves activations.
Properties of activation trees are :-
Each node represents an activation of a procedure.
The root shows the activation of the main function.
The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and only if the control
flows from procedure x to procedure y.
Example – Consider the following program of Quicksort
main() {
Int n;
readarray();
quicksort(1,n);
}
quicksort(int m, int n) {
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
The activation tree for this program will be:
First main function as root then main calls readarray and quicksort. Quicksort in turn calls
partition and quicksort again. The flow of control in a program corresponds to the depth first
traversal of activation tree which starts at the root.
b.What is the peephole Optimization explain it.
Ans. Peephole Optimization
This optimization technique works locally on the source code to transform it into an optimized
code. By locally, we mean a small portion of the code block at hand. These methods can be
applied on intermediate codes as well as on target codes. A bunch of statements is analyzed and
are checked for the following possible optimization:
Redundant instruction elimination
At source code level, the following can be done by the user:
int add_ten(int x)
{
int y, z;
y = 10;
z = x + y;
return z;
}
int add_ten(int x)
{
int y;
y = 10;
y = x + y;
return y;
}
int add_ten(int x)
{
int y = 10;
return x + y;
}
int add_ten(int x)
{
return x + 10;
}
At compilation level, the compiler searches for instructions redundant in nature. Multiple
loading and storing of instructions may carry the same meaning even if some of them are
removed. For example:
MOV x, R0
MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of programming
constructs. Programmers may have accidently written a piece of code that can never be reached.
Example:
void add_ten(int x)
{
return x + 10;
printf(“value of x is %d”, x);
}
In this code segment, the printf statement will never be executed as the program control returns
back before it can execute, hence printf can be removed.
Flow of control optimization
There are instances in a code where the program control jumps back and forth without
performing any significant task. These jumps can be removed. Consider the following chunk of
code:
...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1
and then to L2, the control can directly reach L2, as shown below:
...
MOV R1, R2
GOTO L2
...
L2 : INC R1
Algebraic expression simplification
There are occasions where algebraic expressions can be made simple. For example, the
expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply be
replaced by INC a.
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by
replacing them with other operations that consume less time and space, but produce the same
result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the
output of a * a and a2 is same, a
2 is much more efficient to implement.
Accessing machine instructions
The target machine can deploy more sophisticated instructions, which can have the capability to
perform specific operations much efficiently. If the target code can accommodate those
instructions directly, that will not only improve the quality of code, but also yield more efficient
results.
OR
a. Construct the tree for following expressions and apply
labeling algorithm for ordering
x*(y+z)-z/(u-v)
b. Explain the basic block and control flow graph.
Answer:
Basic Blocks
Source codes generally have a number of instructions, which are always executed in sequence
and are considered as the basic blocks of the code. These basic blocks do not have any jump
statements among them, i.e., when the first instruction is executed, all the instructions in the
same basic block will be executed in their sequence of appearance without losing the flow
control of the program.
A program can have various constructs as basic blocks, like IF-THEN-ELSE, SWITCH-CASE
conditional statements and loops such as DO-WHILE, FOR, and REPEAT-UNTIL, etc.
Basic block identification
We may use the following algorithm to find the basic blocks in a program:
Search header statements of all the basic blocks from where a basic block starts:
o First statement of a program.
o Statements that are target of any branch (conditional/unconditional).
o Statements that follow any branch statement.
Header statements and the statements following them form a basic block.
A basic block does not include any header statement of any other basic block.
Basic blocks are important concepts from both code generation and optimization point of view.
Basic blocks play an important role in identifying variables, which are being used more than
once in a single basic block. If any variable is being used more than once, the register memory
allocated to that variable need not be emptied unless the block finishes execution.
Control Flow Graph
Basic blocks in a program can be represented by means of control flow graphs. A control flow
graph depicts how the program control is being passed among the blocks. It is a useful tool that
helps in optimization by help locating any unwanted loops in the program.