Lexical and syntax analysis
description
Transcript of Lexical and syntax analysis
LEXICAL AND SYNTAX ANALYSIS
CSci210.BA4
Chapter 4 Topics Introduction Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing Bottom-Up Parsing
Introduction Syntax analyzers almost always based
on a formal description of the syntax of the source language (grammars)
Almost all compilers separate analyzing syntax into:Lexical Analysis – low-level Syntax Analysis – high-level
Reasons to Separate Syntax and Lexical Analysis
Simplicity – lexical analysis is less complex, so the process is simpler when separated
Efficiency – allows for selective optimization Portability – lexical analyzer is somewhat
platform dependent whereas the syntax analyzer is more platform independent
Lexical Analysis A pattern matcher for character strings Performs syntax analysis at the lowest
level of the program structure Extracts lexemes from a given input
string and produce the corresponding tokens
Lexical Analysis (continued)result = oldsum – value / 100;
Token LexemeIDENT resultASSIGN_OP = IDENT oldsumSUB_OP -IDENT valueDIV_OP /INT_LIT 100SEMICOLON ;
Building a Lexical Analyzer Write a formal description of the tokens and
use a software tool that constructs lexical analyzers when given such a description
Design a state transition diagram that describes the tokens and write a program that implements the diagram
Design a state transition diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram
State (Transition) Diagram Design A directed graph with nodes labeled with
state names and arcs labeled with input characters
Including states and transitions for each and every token pattern would be too large and complex
Transitions can be combined to simplify the state diagram
The Parsing Problem Two goals of syntax analysis:
Check the input program for any syntax errors, produce a diagnostic message if an error is found, and recover
Produce the parse tree, or at least a trace of the parse tree, for the program
Two Classes of parsers:Top-downBottom-up
Top-Down Parsers Traces or builds a parse tree in preorder
(leftmost derivation) The most common top-down parsing
algorithms:Recursive descentLL parsers
Bottom-Up Parsers Produce the parse tree by beginning at
the leaves and progressing towards the root
Most common bottom-up parsers are in the LR family
Complexity of Parsing Parsing algorithms that work for any
unambiguous grammar are complex and inefficient: O(n3)
Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time: O(n)
Recursive-Descent Parsing Top-Down Parser EBNF is ideal for the basis of a
recursive-descent parserEach terminal maps to a functionFor a non-terminal with more than one RHS,
look at the next token to determine which side to choose
No mapping = syntax error
Recursive-Descent Parsing Grammar for an expression:
<expr> → <term> {+ <term>}<term> → <factor> {* <factor>}<factor> → id | int_constant | ( <expr> )
How do we parse?Expression: 1 + 2
<expr> → <term> + <term> → <factor> + <term>
→ 1 + <term>
Recursive-Descent Parsing Grammar for an expression:
<expr> → <term> {+ <term>}<term> → <factor> {* <factor>}<factor> → id | int_constant | ( <expr> )
What does code look like?void expr() {
term();while (nextToken == ADD_OP) {lex();term();}
}
Recursive-Descent Parsing The LL (Left Recursion) Problem
<expr> → <expr> + <term><expr> → <expr> + <term> + <term><expr> → <expr> + <term> + <term> + <term>
How do we fix it?Modify grammar to remove left recursionBefore: <expr> → <expr> + <term>After: <expr> → <term> + <term>
<term> → id | int_constant | <expr>
Recursive-Descent Parsing The Pairwise Disjointness Problem
If the grammar is not pairwise disjoint, how do you know which RHS to pick based on the next token?
<variable> → identifier | identifier[<expr>]
How do we fix it? Left Factoring
<variable> → identifier<new><new> → ø | [<expr>]
Bottom-Up Parsing Parsing is based on reduction
Reverse of a rightmost derivationAt each step, find the correct RHS that
reduces to the previous step in the derivation
Example Grammar<S> → <A>b Input: ab<A> → a Step 1: <A>b<A> → b Step 2: <S>
Bottom-Up Parsing Most bottom-up parsers are shift-reduce
algorithmsShift – move token onto the stackReduce – replace RHS with LHS
Bottom-Up Parsing Handles
Def: is the handle of the right sentential form iff
= w if and only if S =>*rm Aw =>rm wThe handle of a right sentential form is its
leftmost simple phraseBottom-Up Parsing is essentially looking for
handles and replacing them with their LHS
Bottom-Up Parsing Advantages of Shift Reduction Parsers
They can be built for all programming languages
They can detect syntax errors as soon as it is possible in a left-to-right scan
They LR class of grammars is a proper superset of the class parsable by LL parsers (for example, many left recursive grammars are LR, but none are LL)
Bottom-Up Parsing Shift Reduction Algorithms
Input Sequence – input to be parsedParse Stack – input is shifted onto the
parse stackACTION Table – what the parser doesGOTO Table – holds state symbols to be
pushed onto the stack when a reduction is completed
Bottom-Up Parsing ACTION Table (or Parse Table)
Rows = State SymbolsColumns = Terminal symbols
ValuesShift – push token on stackReduce – replace handle with LHSAccept – stack only has start symbol and
input is emptyError – original input is invalid
Bottom-Up Parsing GOTO Table (or Parse Table)
Rows = State SymbolsColumns = Nonterminal Symbols
Values indicate which state symbol should be pushed onto the parse stack after a reduction has been completed