Compilers and Code Optimizationwpage.unina.it/edoardo.fusella/cco/downloads/lezione3.pdfParsing...
Transcript of Compilers and Code Optimizationwpage.unina.it/edoardo.fusella/cco/downloads/lezione3.pdfParsing...
Compilers and Code
OptimizationEDOARDO FUSELLA
Front-end
Contents
Lexical Analysis
Syntax Analysis
Semantic Analysis
The role of the front end
The front end of the compiler performs analysis.
The analysis is usually broken up into
Lexical Analysis: breaking the input into individual words or “tokens”
Syntax Analysis (or parsing): parsing the phrase structure of the
program
Semantic Analysis: calculating the program’s meaning
Lexical Analysis
Lexical AnalyzerGoals
The lexical analyzer takes a stream of characters and produces a
stream of names, keywords, and punctuation marks
It discards white space and comments between the tokens
It would unduly complicate the parser to have to account for possible
white space and comments at every possible point
Lexical AnalyzerPhases
Lexical analyzer are divided into a cascade of two phases:
Scanning
Consists of the simple processes that do not require tokenization of
the input.
Deletion of comments.
Compaction of consecutive whitespace characters into one.
Lexical analysis
Encode constants as tokens
Recognize Keywords and Identifiers
Store identifier names in a symbol table
Lexical AnalyzerInput/Output
Input
A sequence of characters
Character set:
ASCII
ISO 8859-1 (Latin-1)
ISO 10646 (16-bit = Unicode)
Others (EBCDIC, JIS, etc)
Output
A series of tokens:
Punctuation ( ) ; , [ ]
Operators + - ** :=
Keywords begin end if while try catch
Identifiers Square_Root
String literals “press Enter to continue”
Character literals ‘x’
Numeric literals
Integer: 123
Floating_point: 4_5.23e+2
Based representation: 16#ac#
Lexical AnalyzerFree form vs Fixed form
Free form languages (all modern ones)
White space does not matter. Ignore these:
Tabs, spaces, new lines, carriage returns
Only the ordering of tokens is important
Fixed format languages (historical)
Layout is critical and Lexical analyzer must know about layout to find tokens
Fortran Fixed Format:
80 columns per line
Column 1-5 for the statement number/label column
Column 6 for continuation mark (?)
Column 7-72 for the program statements
Column 73-80 Ignored (Used for other purpose)
Letter C in Column 1 meant the current line is a comment
Lexical AnalyzerPunctuation/Operators
Punctuation
Separators
Typically individual special characters such as ( { } : …)
Sometimes double characters: lexical scanner looks for longest token: (*, /* , --comment openers in various languages
Returns token kind
And perhaps location for error messages and debugging purposes
Operators
Like punctuation
No real difference for lexical analyzer
Typically single or double special chars ( +, -, ==, <= …)
Returns token kind
And perhaps location
Lexical AnalyzerKeywords/Identifiers
Keywords
Reserved identifiers
E.g. BEGIN END in Pascal, if in C, catch in C++
Returns token kind
And perhaps location
Identifiers
Rules differ: Length, allowed characters, separators
Need to build a names table
Single entry for all occurrences of Var1
Language may be case insensitive: same entry for VAR1, vAr1, Var1
Typical structure: hash table
Returns token kind
And key (index) to table entry
Table entry includes location information
Lexical AnalyzerOrganization of names table
Most common structure is hash table
Chain according to hash code
Serial search on one chain
Hash code computed from characters (e.g. sum mod table size).
No hash code is perfect! Expect collisions.
Avoid any arbitrary limits on table or chain size.
Lexical AnalyzerString and Character Literals
String Literals
Text must be stored
Actual characters are important
Not like identifiers: must preserve casing
Character set issues: uniform internal representation
Table needed
Lexical analyzer returns key into table
May or may not be worth hashing to avoid duplicates
Character Literals
Similar issues to string literals
Lexical Analyzer returns token kind and identity of character
Lexical AnalyzerNumeric Literals
Need a table to store numeric value
E.g. 123 = 0123 = 01_23 (Ada uses underscores to separate groups of digits)
But cannot use predefined type for values
Because may have different bounds
Floating point representations much more complex
Denormals, correct rounding
Very delicate to compute correct value
Host / target issues
Lexical AnalyzerHandling Comments
Comments have no effect on program
Can be eliminated by scanner
But may need to be retrieved by tools
Error detection issues
E.g. unclosed comments
Scanner skips over comments and returns next meaningful token
Lexical AnalyzerCase Equivalence
Some languages are case-insensitive
Pascal, Ada
Some are not
C, Java
Lexical analyzer ignores case if needed
This_Routine = THIS_RouTine
Error analysis may need exact casing
Friendly diagnostics follow user’s conventions
Lexical AnalyzerPerformance Issues
Lexical analysis can become bottleneck
Minimize processing per character
Skip blanks fast
I/O is also an issue (read large blocks)
We compile frequently
Compilation time is important, especially during development
Communicate with parser through global variables
Lexical AnalyzerInterface to Lexical Analyzer
Either: Convert entire file to a file of tokens
Lexical analyzer is separate phase
Or: Parser calls lexical analyzer to supply next token
This approach avoids extra I/O
Parser builds tree incrementally, using successive tokens as tree
nodes
Lexical AnalyzerFormalism: Regular grammar
Non-terminals (arbitrary names)
Terminals (characters)
Productions limited to the following:
Non-terminal ::= terminal
Non-terminal ::= terminal Non-terminal
Treat character class (e.g. digit) as terminal
Regular grammars cannot count:
Cannot express size limits on identifiers, literals
Cannot express proper nesting (parentheses)
Lexical AnalyzerFormalism: Regular grammar
Grammar for real literals with no exponent
digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
REAL ::= digit REAL1
REAL1 ::= digit REAL1 (arbitrary size)
REAL1 ::= . INTEGER
INTEGER ::= digit INTEGER (arbitrary size)
INTEGER ::= digit
Start symbol is REAL
Lexical AnalyzerFormalism: Regular Expressions
Regular expressions (RE) defined by an alphabet (terminal
symbols) and three operations:
Alternation (union) RE1 | RE2
Concatenation RE1 RE2
Repetition RE* (zero or more RE’s)
Language of RE’s = regular grammars
Regular expressions are more convenient for some applications
Lexical AnalyzerFinite State Machines
A language defined by a grammar is a (possibly infinite) set of
strings
An automaton is a computation that determines whether a given
string belongs to a specified language
A finite state machine (FSM) is an automaton that recognize
regular languages (regular expressions)
Lexical AnalyzerSpecifying an FSM
A set of labeled states
Directed arcs between states labeled with character
One or more states may be terminal
A distinguished state is start
Automaton makes transition from state S1 to S2
If and only if arc from S1 to S2 is labeled with next character in input
Token is legal if automaton stops on terminal state
Lexical AnalyzerBuilding FSM from Grammar
One state for each non-terminal
A rule of the form
Non-terminal-1 ::= terminal
Generates transition from S1 to final state
A rule of the form
Non-terminal-1 ::= terminal Non-terminal-2
Generates transition from S1 to S2 on an arc labeled by the terminal
Lexical AnalyzerGraphic representation
digit
digit
Real
digit
.
Lexical AnalyzerBuilding FSM’s from RE’s
Every RE corresponds to a grammar
For all regular expressions
A natural translation to FSM exists
Alternation often leads to non-deterministic machines
Syntax Analysis
Syntax AnalysisParsing
Parsing is the process of determining whether a string of tokens can be
generated by a grammar.
Most parsing methods fall into one of two classes, called the top-down
and bottom-up methods.
In top-down parsing, construction starts at the root and proceeds to the
leaves. In bottom-up parsing, construction starts at the leaves and
proceeds towards the root.
Efficient top-down parsers are easy to build by hand.
Bottom-up parsing, however, can handle a larger class of grammars.
They are not as easy to build, but tools for generating them directly
from a grammar are available.
Syntax AnalysisContext-free Grammars
Context-free grammar (CFG)
Language is a set of strings; each string is a finite sequence of symbols taken from a finite alphabet
For parsing
the strings are source programs
the symbols are lexical tokens
the alphabet is the set of token-types returned by the lexical analyzer
A grammar has a set of productions of the form
𝑠𝑦𝑚𝑏𝑜𝑙 → 𝑠𝑦𝑚𝑏𝑜𝑙 𝑠𝑦𝑚𝑏𝑜𝑙 … 𝑠𝑦𝑚𝑏𝑜𝑙
Where
there are zero or more symbols on the right-hand side
Each symbol is either terminal, meaning that it is a token from the alphabet of strings in the language, or nonterminal, meaning that it appears on the left-hand side of some production.
Syntax AnalysisDerivations
Productions are treated as rewriting rules to generate a string
We can perform a derivation to show that a certain sentence is in the
language of the grammar
Start with the start symbol, then repeatedly replace any nonterminal by one of its
right-hand sides
Rightmost and leftmost derivations
A rightmost/leftmost derivation is one in which the rightmost/leftmost nonterminal symbol
is always the one expanded
𝐸 → 𝐸 + 𝐸 | 𝐸 ∗ 𝐸 | − 𝐸 | (𝐸) | 𝑖𝑑
Derivations for – (𝑖𝑑 + 𝑖𝑑)
𝐸 => −𝐸 => −(𝐸) => −(𝐸 + 𝐸) => −(𝑖𝑑 + 𝐸) => −(𝑖𝑑 + 𝑖𝑑)
Syntax AnalysisDerivations: example
Grammar 1
𝐸 → 𝑖𝑑
𝐸 → 𝑛𝑢𝑚
𝐸 → 𝐸 ∗ 𝐸
𝐸 → 𝐸/𝐸
𝐸 → 𝐸 + 𝐸
𝐸 → 𝐸 − 𝐸
𝐸 → 𝐸
Derivation for 1-2-3
𝐸 → 𝐸 − 𝐸
𝐸 → 𝐸 − 3
𝐸 → 𝐸 − 𝐸 − 3
𝐸 → 𝐸 − 2 − 3
𝐸 → 1 − 2 − 3
Parse tree
Syntax AnalysisAmbiguous Grammars
A grammar is ambiguous if it can derive a sentence with different parse trees.
Grammar 1 is ambiguous
Parse trees for the sentence 1-2-3
(1 − 2) − 3 = −4 versus 1 − (2 − 3) = 2
Syntax AnalysisAmbiguous Grammars
Similarly
1 + 2 ∗ 3 versus 1 + (2 ∗ 3)
Syntax AnalysisElimination of ambiguity
Ambiguous grammars are problematic for compiling
Unambiguous grammars preferred
Often ambiguous grammars can be transformed into unambiguous
grammars.
Considering previous example
∗ has higher precedence than +
each operator associates to the left, so that we get (1 − 2) − 3 instead of 1 − (2 − 3)
Syntax AnalysisElimination of ambiguity: example
Grammar 2
𝐸 → 𝐸 + 𝑇
𝐸 → 𝐸 − 𝑇
𝐸 → 𝑇
𝑇 → 𝑇 ∗ 𝐹
𝑇 → 𝑇/𝐹
𝑇 → 𝐹
𝐹 → 𝑖𝑑
𝐹 → 𝑛𝑢𝑚
𝐹 → 𝐸
Derivation for 1-2-3
𝐸 → 𝐸 − 𝑇
𝐸 → 𝐸 − F
𝐸 → 𝐸 − 3
𝐸 → 𝐸 − 𝑇 − 3
𝐸 → 𝐸 − 𝐹 − 3
𝐸 → 𝐸 − 2 − 3
𝐸 → 𝑇 − 2 − 3
𝐸 → 𝐹 − 2 − 3
𝐸 → 1 − 2 − 3
• Same set of sentences as the
ambiguous grammar
• Each sentence has exactly one
parse tree
• The symbols 𝐸, 𝑇 , and 𝐹 stand for
expression, term, and factor
• factors are things you multiply
• terms are things you add
Top Down Parsing
A Top-down parser tries to create a parse tree from the root towards the leafs scanning input from left to right
find a leftmost derivation for an input string
Example:
S → cAd Input: cad
A → ab | a
S S Backtrack S
/ | \ / | \ / | \
c A d c A d when we choose c A d
/ \ the wrong rule |
a b a
Recursive Descent Parsing
Top-down parser
Each production corresponds to one recursive procedure
Each procedure recognizes an instance of a non-terminal
returns tree fragment for the non-terminal
Example:
S → if E then S else S L → end
S → begin S L L → S L
S → print E E → num = num
One function for each nonterminal
One clause for each production
Recursive Descent ParsingExample
S → if E then S else S
S → begin S L
S → print E
L → end
L → S L
E → num = num
Bottom-up Parsing
Constructs parse tree for an input string beginning
at the leaves (the bottom) and working towards
the root (the top)
Example: id*id
𝐸 → 𝐸 + 𝑇 | 𝑇
𝑇 → 𝑇 ∗ 𝐹 | 𝐹
𝐹 → 𝐸 | 𝑖𝑑
id
F * idid*id T * id
id
F
T * F
id
F id T * F
id
F id
T
T * F
id
F id
T
E
Shift-reduce parser
Bottom-up parser
The general idea is to shift some symbols of input to the stack until a reduction can be applied
At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of the production
The key decisions during bottom-up parsing are about when to reduce and about what production to apply
A reduction is a reverse of a step in a derivation
The goal of a bottom-up parser is to construct a derivation in reverse:
𝐸 → 𝑇 → 𝑇 ∗ 𝐹 → 𝑇 ∗ 𝑖𝑑 → 𝐹 ∗ 𝑖𝑑 → 𝑖𝑑 ∗ 𝑖𝑑
Shift-reduce parserHandle pruning
A Handle is a substring that matches the body of a
production and whose reduction represents one step
along the reverse of a rightmost derivation
Right sentential form Handle Reducing production
id*id id F->id
F*id F
id
T->F
T*id F->id
T*F T*F E->T*F
𝐸 → 𝐸 + 𝑇 | 𝑇𝑇 → 𝑇 ∗ 𝐹 | 𝐹𝐹 → 𝐸 | 𝑖𝑑
Shift-reduce parserHandle pruning
Basic operations:
Shift
Reduce
Accept
Error
Example: 𝑖𝑑 ∗ 𝑖𝑑
Stack Input Action
$
$id
id*id$ shift
*id$ reduce by F->id$F *id$ reduce by T->F$T *id$ shift$T* id$ shift
$T*id $ reduce by F->id
$T*F $ reduce by T->T*F
$T $ reduce by E->T
$E $ accept
Semantic Analysis
Role of Semantic Analysis
The principal job of the semantic analyzer is to enforce static semantic rules
constructs a syntax tree (usually first)
information gathered is needed by the code generator
Considerable variety in the extent to which parsing, semantic analysis, and intermediate code generation are interleaved
A common approach interleaves construction of a syntax tree with parsing, and then follows with separate, sequential phases for semantic analysis and code generation
Semantic Analysis Attribute Grammars
Context-Free Grammars (CFGs) are used to specify the syntax of programming languages
E.g. arithmetic expressions
How do we tie these rules to mathematical concepts?
Attribute grammars are annotated CFGs in which annotations are used to establish meaning relationships among symbols
Provide a formal framework for decorating such a tree
Both semantic analysis and (intermediate) code generation can be described in terms of annotation, or "decoration" of a parse/syntax tree
Semantic Analysis Attribute Grammars: an example
Each grammar symbols
has a set of attributes
E.g. the value of E1 is
the attribute E1.val
Each grammar rule has
a set of rules over the
symbol attributes
Semantic Function
rules
E.g. sum, quotient
Copy rules
1. 𝐸 → 𝐸 + 𝑇
2. 𝐸 → 𝐸 − 𝑇
3. 𝐸 → 𝑇
4. 𝑇 → 𝑇 ∗ 𝐹
5. 𝑇 → 𝑇/𝐹
6. 𝑇 → 𝐹
7. 𝐹 → 𝑖𝑑
8. 𝐹 → 𝑛𝑢𝑚
9. 𝐹 → 𝐸
1. 𝐸1 → 𝐸2 + 𝑇 𝐸1. 𝑣𝑎𝑙 ≔ 𝑠𝑢𝑚(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)
2. 𝐸1 → 𝐸2 − 𝑇 𝐸1. 𝑣𝑎𝑙 ≔ 𝑑𝑖𝑓𝑓(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)
3. 𝐸 → 𝑇 𝐸. 𝑣𝑎𝑙 ≔ 𝑇. 𝑣𝑎𝑙
4. 𝑇1 → 𝑇2 ∗ 𝐹 𝑇1. 𝑣𝑎𝑙 ≔ 𝑝𝑟𝑜𝑑(𝑇2. 𝑣𝑎𝑙, 𝐹. 𝑣𝑎𝑙)
5. 𝑇1 → 𝑇2/𝐹 𝑇1. 𝑣𝑎𝑙 ≔ 𝑞𝑢𝑜𝑡(𝑇2. 𝑣𝑎𝑙, 𝐹. 𝑣𝑎𝑙)
6. 𝑇 → 𝐹 𝑇. 𝑣𝑎𝑙 ≔ 𝐹. 𝑣𝑎𝑙
7. 𝐹 → 𝑖𝑑 𝐹. 𝑣𝑎𝑙 ≔ 𝑖𝑑. 𝑣𝑎𝑙
8. 𝐹 → 𝑛𝑢𝑚 𝐹. 𝑣𝑎𝑙 ≔ 𝑛𝑢𝑚. 𝑣𝑎𝑙
9. 𝐹 → 𝐸 𝐹. 𝑣𝑎𝑙 ≔ 𝐸. 𝑣𝑎𝑙
Semantic Analysis Attribute Grammars
The attribute grammar serves to define the semantics of
the input program
Attribute rules are best thought of as definitions, not
assignments
They are not necessarily meant to be evaluated at any
particular time, or in any particular order, though they do
define their left-hand side in terms of the right-hand side
Semantic Analysis Evaluating Attributes
The process of evaluating attributes is called annotation,
or decoration, of the parse tree
When a parse tree under this grammar is fully decorated,
the value of the expression will be in the val attribute of the
root
The code fragments for the rules are called semantic
functions (they should be cast as functions)
e.g. 𝐸1. 𝑣𝑎𝑙 ≔ 𝑠𝑢𝑚(𝐸2. 𝑣𝑎𝑙, 𝑇. 𝑣𝑎𝑙)
Semantic Analysis Evaluating Attributes
The figure shows the result of annotating the parse tree for (1 + 3) ∗ 2
Each symbols has at most one attribute shown in the corresponding box
Numerical value in this example
Punctuation marks have no attributes
Operator symbols have no value
Arrows represent attribute flow
A bottom up approach
a) The values of the constants 1 and 3have been placed in new syntax tree
leaves
b) The pointers to these leaves become
child pointers of a new internal
+ node
c) The pointers to this node propagates
up into the attributes of 𝑇, and a new
leaf is created for 2
d) The pointers for 𝑇 and 𝐹 become
child pointers of a new internal ∗node, and a pointer to this node
propagates up into the attributes of 𝐸
Semantic Analysis Construction of the Syntax Tree