compiler

Lexical Analysis

OutlineRole of lexical analyzerSpecification of tokensRecognition of tokensLexical analyzer generatorFinite automataDesign of lexical analyzer generator

Compiler ConstructionLexicalanalyzerParserSource programread charput back charpass tokenand attribute valueget nextSymbol TableRead entireprogram into memoryid

The role of lexical analyzerLexical AnalyzerParserSourceprogramtokengetNextTokenSymboltableTo semanticanalysis

Other tasksStripping out from the source program comment and white space in the form of blank, tab and newline characters .Correlate error messages with source program (e.g., line number of error).

Lexical analyzerscanning (simple operations)lexical analysis(complex)

Why to separate Lexical analysis and parsingSimplicity of design

Improving compiler efficiency

Enhancing compiler portability

Eliminate the white space and comment lines before parsing

A large amount of time is being spent on reading the source program and generating the tokens

The representation of special or non standard symbols can be isolated in the lexical analyzer

Tokens, Patterns and LexemesPattern: A rule that describes a set of stringsToken: A set of strings in the same patternLexeme: The sequence of characters of a token

ExampleTokenInformal descriptionSample lexemesifelsecomparisonidnumberliteralCharacters i, fCharacters e, l, s, e< or > or = or == or !=Letter followed by letter and digitsAny numeric constantAnything but sorrounded by ifelse

Compiler ConstructionE = C1 * 10

TokenAttributeIDIndex to symbol table entry E=IDIndex to symbol table entry C1*NUM10

Compiler ConstructionLexical Error and RecoveryError detectionError reportingError recoveryDelete the current character and restart scanning at the next characterDelete the first character read by the scanner and resume scanning at the character following it.

Input bufferingSometimes lexical analyzer needs to look ahead some symbols to decide about the token to returnIn C language: we need to look after -, = or < to decide what token to returnWe need to introduce a two buffer scheme to handle large look-aheads safely

E = M * C * * 2 eof

Compiler ConstructionSpecification of TokensRegular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in specifying those types of patterns that we actually need for tokens.

Specification of tokensIn theory of compilation regular expressions are used to formalize the specification of tokensRegular expressions are means for specifying regular languagesExample:Letter(letter| digit)*Each regular expression is a pattern specifying the form of strings

Regular expressions is a regular expression, L() = {}If a is a symbol in then a is a regular expression, L(a) = {a}(r) | (s) is a regular expression denoting the language L(r) L(s) (r)(s) is a regular expression denoting the language L(r)L(s)(r)* is a regular expression denoting (L(r))*R* = R concatenated with itself 0 or more times= {} R RR RRR (r) is a regular expression denoting L(r)

ExtensionsOne or more instances: (r)+Zero of one instances: r?Character classes: [ abc ]

Example:letter_ -> [A-Z , a-z_]digit -> [0-9]id -> letter(letter | digit)*

Regular definitionsd1 -> r1d2 -> r2d n -> r n

Example:letter -> A | B | | Z | a | b | | z | _digit -> 0 | 1 | | 9id -> letter(letter| digit)*

Operations on Languages

Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D be the set of digits {0,1,.. .9). L and D are, respectively, the alphabets of uppercase and lowercase letters and of digits. other languages can be constructed from L and D, using the operators illustrated above

Compiler ConstructionOperations on Languages 1. L U D is the set of letters and digits - strictly speaking the language with 62 (52+10) strings of length one, each of which strings is either one letter or one digit.2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.(1052). Ex: A1, a1,B0,etc3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)4. L* is the set of all strings of letters, including e, the empty string.5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.6. D+ is the set of all strings of one or more digits.

Compiler ConstructionTerms for Parts of Strings

Specification of TokensAlgebraic laws of regular expressions 1) |= |2) |(|)=(|)| () =( ) 3) (| )= | (|)= | 4) = = 5)(*)*=*6) *=| * = *7) (|)*= (* | *)*= (* *)*

Recognition of TokensTask of recognition of token in a lexical analyzerIsolate the lexeme for the next token in the input bufferProduce as output a pair consisting of the appropriate token and attribute-value, such as , using the translation table given in the Fig in next page

Recognition of TokensTask of recognition of token in a lexical analyzer

Regular expressionTokenAttribute-valueifif-ididPointer to table entry if expr then stmt | if expr then stmt else stmt | expr -> term relop term | termterm -> id | number

Recognition of tokens (cont.)The next step is to formalize the patterns:digit -> [0-9]digits -> digit+number -> digits(.digits)? (E[+|-]? digits)?letter -> [A-Z a-z_]id -> letter (letter | digit)*If -> ifThen -> thenElse -> elseRelop -> < | > | = | = | We also need to handle whitespaces:ws -> (blank | tab | newline)+

Compiler ConstructionEx :RELOP = < | | >=

015623478start

>otherotherreturn(relop,LE)return(relop,NE)return(relop,LT)return(relop,GE)return(relop,GT)return(relop,EQ)## # indicates input retraction

Compiler Construction

Ex2:ID = letter(letter | digit) *

91011startletterreturn(id) # indicates input retractionother#letter or digitTransition Diagram:

Transition diagrams (cont.)Transition diagram for whitespace

91011startletterreturn(id)otherletter or digitswitch (state) {case 9:if (isletter( c) ) state = 10; else state = failure();break;case 10: c = nextchar(); if (isletter( c) || isdigit( c) ) state = 10; else state 11case 11: retract(1); insert(id); return;

Architecture of a transition-diagram-based lexical analyzerTOKEN getRelop(){TOKEN retToken = new (RELOP)while (1) {/* repeat character processing until areturn or failure occurs*/switch(state) {case 0: c= nextchar(); if (c == ) state = 6; else fail();/* lexeme is not a relop */ break;case 1: case 8: retract(); retToken.attribute = GT; return(retToken);}

Lexical Analyzer Generator - LexLexical CompilerLex Source programlex.llex.yy.cCcompilerlex.yy.ca.outa.outInput streamSequence of tokens

Structure of Lex programs declarations%%translation rules%%auxiliary functionsPattern {Action}

Example%{/* definitions of manifest constantsLT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */%}

/* regular definitionsdelim[ \t\n]ws{delim}+letter[A-Za-z]digit[0-9]id{letter}({letter}|{digit})*number{digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%{ws}{/* no action and no return */}if{return(IF);}then{return(THEN);}else{return(ELSE);}{id}{yylval = (int) installID(); return(ID); }{number}{yylval = (int) installNum(); return(NUMBER);}Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */}

Int installNum() { /* similar to installID, but puts numerical constants into a separate table */}

*Finite AutomataRegular expressions = specificationFinite automata = implementation

A finite automaton consists ofAn input alphabet A set of states SA start state nA set of accepting states F SA set of transitions state input state

*Finite AutomataTransitions1 a s2Is readIn state s1 on input a go to state s2

If end of inputIf in accepting state => accept, othewise => rejectIf no transition possible => reject

*Finite Automata State GraphsA stateThe start stateAn accepting stateA transition

*A Simple ExampleA finite automaton that accepts only 1

A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state

*Another Simple ExampleA finite automaton accepting any number of 1s followed by a single 0Alphabet: {0,1}

Check that 1110 is accepted but 110 is not

*And Another ExampleAlphabet {0,1}What language does this recognize?

*And Another ExampleAlphabet still { 0, 1 }

The operation of the automaton is not completely defined by the inputOn input 11 the automaton could be in either state

*Epsilon MovesAnother kind of transition: -movesMachine can move from state A to state B without reading inputAB

*Deterministic and Nondeterministic AutomataDeterministic Finite Automata (DFA)One transition per input per state No -movesNondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a given stateCan have -movesFinite automata have finite memoryNeed only to encode the current state

*Execution of Finite AutomataA DFA can take only one path through the state graphCompletely determined by input

NFAs can chooseWhether to make -movesWhich of multiple transitions for a single input to take

*Acceptance of NFAsAn NFA can get into multiple statesInput:101Rule: NFA accepts if it can get in a final state

*NFA vs. DFA (1)NFAs and DFAs recognize the same set of languages (regular languages)

DFAs are easier to implementThere are no choices to consider

*NFA vs. DFA (2)For a given language the NFA can be simpler than the DFANFADFADFA can be exponentially larger than NFA

*Regular Expressions to Finite AutomataHigh-level sketchRegularexpressionsNFADFALexicalSpecificationTable-driven Implementation of DFA

*Regular Expressions to NFA (1)For each kind of rexp, define an NFANotation: NFA for rexp A For For input a

*Regular Expressions to NFA (2)For ABFor A | B

*Regular Expressions to NFA (3)For A*A

*Example of RegExp -> NFA conversionConsider the regular expression(1 | 0)*1The NFA is

*NextRegularexpressionsNFADFALexicalSpecificationTable-driven Implementation of DFA

*NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA = a non-empty subset of states of the NFAStart state = the set of NFA states reachable through -moves from NFA start stateAdd a transition S a S to DFA iffS is the set of NFA states reachable from the states in S after seeing the input aconsidering -moves as well

*NFA -> DFA Example101ABCDEFGHIJABCDHIFGABCDHIEJGABCDHI010101

*NFA to DFA. RemarkAn NFA may be in many states at any time

How many different states ?

If there are N states, the NFA must be in some subset of those N states

How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many

*ImplementationA DFA can be implemented by a 2D table TOne dimension is statesOther dimension is input symbolsFor every transition Si a Sk define T[i,a] = kDFA executionIf in state Si and input a, read T[i,a] = k and skip to state SkVery efficient

*Table Implementation of a DFASTU010101

01STUTTUUTU

*Implementation (Cont.)NFA -> DFA conversion is at the heart of tools such as flex or jflex

But, DFAs can be huge

In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

ReadingsChapter 3 of the book

*************************

compiler

Documents

Transcript of compiler