Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.
Lexical Analysis and Lexical Analysis and ScanningScanning
Honors CompilersHonors Compilers
Feb 5Feb 5thth 2001 2001
Robert DewarRobert Dewar
The InputThe Input
Read string inputRead string inputMight be sequence of characters (Unix)Might be sequence of characters (Unix)Might be sequence of lines (VMS)Might be sequence of lines (VMS)Character setCharacter set
ASCIIASCIIISO Latin-1ISO Latin-1ISO 10646 (16-bit = unicode)ISO 10646 (16-bit = unicode)Others (EBCDIC, JIS, etc) Others (EBCDIC, JIS, etc)
The OutputThe Output
A series of A series of tokenstokensPunctuation ( ) ; , [ ]Punctuation ( ) ; , [ ]Operators + - ** :=Operators + - ** :=Keywords begin end ifKeywords begin end ifIdentifiersIdentifiers Square_Root Square_RootString literals “hello this is a string”String literals “hello this is a string”Character literals ‘x’Character literals ‘x’Numeric literals 123 4_5.23e+2 Numeric literals 123 4_5.23e+2
16#ac# 16#ac#
Free form vs Fixed formFree form vs Fixed form
Free form languagesFree form languagesWhite space does not matterWhite space does not matter
Tabs, spaces, new lines, carriage returnsTabs, spaces, new lines, carriage returnsOnly the ordering of tokens is importantOnly the ordering of tokens is important
Fixed format languagesFixed format languagesLayout is criticalLayout is critical
Fortran, label in cols 1-6Fortran, label in cols 1-6COBOL, area A BCOBOL, area A BLexical analyzer must worry about layoutLexical analyzer must worry about layout
PunctuationPunctuation
Typically individual special Typically individual special characterscharactersSuch as + -Such as + -Lexical analyzer does not know : from :Lexical analyzer does not know : from :Sometimes double charactersSometimes double characters
E.g. (* treated as a kind of bracketE.g. (* treated as a kind of bracketReturned just as identity of tokenReturned just as identity of token
And perhaps locationAnd perhaps locationFor error message and debugging purposesFor error message and debugging purposes
OperatorsOperators
Like punctuationLike punctuationNo real difference for lexical analyzerNo real difference for lexical analyzerTypically single or double special charsTypically single or double special chars
Operators + -Operators + -Operations :=Operations :=
Returned just as identity of tokenReturned just as identity of tokenAnd perhaps locationAnd perhaps location
KeywordsKeywords
Reserved identifiersReserved identifiersE.g. E.g. BEGIN ENDBEGIN END in Pascal, in Pascal, ifif in C in CMaybe distinguished from identifiersMaybe distinguished from identifiers
E.g. mode vs E.g. mode vs modemode in Algol-68 in Algol-68Returned just as token identityReturned just as token identity
With possible location informationWith possible location informationUnreserved keywords (e.g. PL/1)Unreserved keywords (e.g. PL/1)
Handled as identifiers (parser distinguishes)Handled as identifiers (parser distinguishes)
IdentifiersIdentifiers
Rules differRules differLength, allowed characters, separatorsLength, allowed characters, separators
Need to build tableNeed to build tableSo that junk1 is recognized as junk1So that junk1 is recognized as junk1Typical structure: hash tableTypical structure: hash table
Lexical analyzer returns token typeLexical analyzer returns token typeAnd key to table entryAnd key to table entryTable entry includes location Table entry includes location
informationinformation
More on Identifier TablesMore on Identifier Tables
Most common structure is hash tableMost common structure is hash tableWith fixed number of headersWith fixed number of headersChain according to hash codeChain according to hash codeSerial search on one chainSerial search on one chainHash code computed from charactersHash code computed from charactersNo hash code is perfect!No hash code is perfect!Avoid any arbitrary limitsAvoid any arbitrary limits
String LiteralsString Literals
Text must be storedText must be storedActual characters are importantActual characters are important
Not like identifiersNot like identifiersCharacter set issuesCharacter set issuesTable neededTable needed
Lexical analyzer returns key to tableLexical analyzer returns key to tableMay or may not be worth hashingMay or may not be worth hashing
Character LiteralsCharacter Literals
Similar issues to string literalsSimilar issues to string literalsLexical Analyzer returnsLexical Analyzer returns
Token typeToken typeIdentity of characterIdentity of character
Note, cannot assume character set of Note, cannot assume character set of host machine, may be differenthost machine, may be different
Numeric LiteralsNumeric Literals
Also need a tableAlso need a tableTypically record valueTypically record value
E.g. 123 = 0123 = 01_23 (Ada)E.g. 123 = 0123 = 01_23 (Ada)But cannot use But cannot use intint for values for values
Because may have different characteristicsBecause may have different characteristicsFloat stuff much more complexFloat stuff much more complex
Denormals, correct roundingDenormals, correct roundingVery delicate stuffVery delicate stuff
Handling CommentsHandling Comments
Comments have no effect on programComments have no effect on programCan therefore be eliminated by Can therefore be eliminated by
scannerscannerBut may need to be retrieved by toolsBut may need to be retrieved by toolsError detection issuesError detection issues
E.g. unclosed commentsE.g. unclosed commentsScanner does not return commentsScanner does not return comments
Case EquivalenceCase Equivalence
Some languages have case Some languages have case equivalenceequivalencePascal, AdaPascal, Ada
Some do notSome do notC, JavaC, Java
Lexical analyzer ignores case if Lexical analyzer ignores case if neededneededThis_Routine = THIS_RouTineThis_Routine = THIS_RouTineError analysis may need exact casingError analysis may need exact casing
Issues to AddressIssues to Address
SpeedSpeedLexical analysis can take a lot of timeLexical analysis can take a lot of timeMinimize processing per characterMinimize processing per character
I/O is also an issue (read large blocks)I/O is also an issue (read large blocks)We compile frequentlyWe compile frequently
Compilation time is importantCompilation time is importantEspecially during developmentEspecially during development
General ApproachGeneral Approach
Define set of token codesDefine set of token codesAn enumeration typeAn enumeration typeA series of integer definitionsA series of integer definitionsThese are just codes (no semantics)These are just codes (no semantics)Some codes associated with dataSome codes associated with data
E.g. key for identifier tableE.g. key for identifier tableMay be useful to build tree nodeMay be useful to build tree node
For identifiers, literals etcFor identifiers, literals etc
Interface to Lexical AnalyzerInterface to Lexical Analyzer
Convert entire file to a file of tokensConvert entire file to a file of tokensLexical analyzer is separate phaseLexical analyzer is separate phase
Parser calls lexical analyzerParser calls lexical analyzerGet next tokenGet next tokenThis approach avoids extra I/OThis approach avoids extra I/OParser builds tree as we go alongParser builds tree as we go along
Implementation of ScannerImplementation of Scanner
Given the input textGiven the input textGenerate the required tokensGenerate the required tokensOr provide token by token on Or provide token by token on
demanddemandBefore we describe implementationsBefore we describe implementations
We take this short breakWe take this short breakTo describe relevant formalismsTo describe relevant formalisms
Relevant FormalismsRelevant Formalisms
Type 3 (Regular) GrammarsType 3 (Regular) GrammarsRegular ExpressionsRegular ExpressionsFinite State MachinesFinite State Machines
Regular GrammarsRegular Grammars
Regular grammarsRegular grammars Non-terminals (arbitrary names)Non-terminals (arbitrary names) Terminals (characters)Terminals (characters) Two forms of rulesTwo forms of rules
Non-terminal ::= terminalNon-terminal ::= terminal Non-terminal ::= terminal Non-terminalNon-terminal ::= terminal Non-terminal
One non-terminal is the start symbolOne non-terminal is the start symbol Regular (type 3) grammars cannot countRegular (type 3) grammars cannot count
No concept of matching nested parensNo concept of matching nested parens
Regular GrammarsRegular Grammars
Regular grammarsRegular grammarsE.g. grammar of reals with no exponentE.g. grammar of reals with no exponent
REAL ::= 0 REAL1 (repeat for 1 .. 9)REAL ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= . INTEGER REAL1 ::= . INTEGER INTEGER ::= 0 INTEGER (repeat for 1 .. 9)INTEGER ::= 0 INTEGER (repeat for 1 .. 9)INTEGER ::= 0 (repeat for 1 .. 9)INTEGER ::= 0 (repeat for 1 .. 9)
Start symbol is REALStart symbol is REAL
Regular ExpressionsRegular Expressions
Regular expressions (RE) defined byRegular expressions (RE) defined byAny terminal character is an REAny terminal character is an REAlternation RE | REAlternation RE | REConcatenation RE1 RE2Concatenation RE1 RE2Repetition RE* (zero or more RE’s)Repetition RE* (zero or more RE’s)
Language of RE’s = type 3 grammarsLanguage of RE’s = type 3 grammarsRegular expressions are more Regular expressions are more
convenientconvenient
Specifying RE’s in Unix ToolsSpecifying RE’s in Unix Tools
Single characters a b c d \xSingle characters a b c d \xAlternation [bcd] [b-z] ab|cdAlternation [bcd] [b-z] ab|cdMatch any character .Match any character .Match sequence of characters x* y+Match sequence of characters x* y+Concatenation abc[d-q]Concatenation abc[d-q]Optional [0-9]+(.[0-9]*)?Optional [0-9]+(.[0-9]*)?
Finite State MachinesFinite State Machines
Languages and AutomataLanguages and AutomataA language is a set of stringsA language is a set of stringsAn automaton is a machineAn automaton is a machine
That determines if a given string is in That determines if a given string is in the language or not.the language or not.
FSM’s are automata that recognize FSM’s are automata that recognize regular languages (regular regular languages (regular expressions) expressions)
Definitions of FSMDefinitions of FSM
A set of labeled statesA set of labeled statesDirected arcs labeled with characterDirected arcs labeled with characterA state may be marked as terminalA state may be marked as terminalTransition from state S1 to S2Transition from state S1 to S2
If and only if arc from S1 to S2If and only if arc from S1 to S2Labeled with next character (which is eaten)Labeled with next character (which is eaten)
Recognized if ends up in terminal Recognized if ends up in terminal statestate
One state is distinguished start stateOne state is distinguished start state
Building FSM from GrammarBuilding FSM from Grammar
One state for each non-terminalOne state for each non-terminalA rule of the formA rule of the form
Nont1 ::= terminalNont1 ::= terminalGenerates transition from S1 to final Generates transition from S1 to final
statestateA rule of the formA rule of the form
Nont1 ::= terminal Nont2Nont1 ::= terminal Nont2Generates transition from S1 to S2Generates transition from S1 to S2
Building FSM’s from RE’sBuilding FSM’s from RE’s
Every RE corresponds to a grammarEvery RE corresponds to a grammarFor all regular expressionsFor all regular expressions
A natural translation to FSM existsA natural translation to FSM existsWe will not give details of algorithm We will not give details of algorithm
herehere
Non-Deterministic FSMNon-Deterministic FSM
A non-deterministic FSMA non-deterministic FSMHas at least one stateHas at least one state
With two arcs to two separate statesWith two arcs to two separate statesLabeled with the same characterLabeled with the same character
Which way to go?Which way to go?Implementation requires backtrackingImplementation requires backtrackingNasty Nasty
Deterministic FSMDeterministic FSM
For all states SFor all states SFor all characters CFor all characters C
There is either ONE or NO arcsThere is either ONE or NO arcsFrom state SFrom state SLabeled with character CLabeled with character C
Much easier to implementMuch easier to implementNo backtracking No backtracking
Dealing with ND FSMDealing with ND FSM
Construction naturally leads to ND Construction naturally leads to ND FSMFSM
For example, consider FSM forFor example, consider FSM for[0-9]+ | [0-9]+\.[0-9]+[0-9]+ | [0-9]+\.[0-9]+
(integer or real)(integer or real)We will naturally get a start stateWe will naturally get a start state
With two sets of 0-9 branchesWith two sets of 0-9 branchesAnd thus non-deterministicAnd thus non-deterministic
Converting to DeterministicConverting to Deterministic
There is an algorithm for convertingThere is an algorithm for convertingFrom any ND FSMFrom any ND FSM
To an equivalent deterministic FSMTo an equivalent deterministic FSM
Algorithm is in the text bookAlgorithm is in the text bookExample (given in terms of RE’s)Example (given in terms of RE’s)
[0-9]+ | [0-9]+\.[0-9]+[0-9]+ | [0-9]+\.[0-9]+[0-9]+(\.[0-9]+)?[0-9]+(\.[0-9]+)?
Implementing the ScannerImplementing the Scanner
Three methodsThree methodsCompletely informal, just write codeCompletely informal, just write codeDefine tokens using regular expressionsDefine tokens using regular expressions
Convert RE’s to ND finite state machineConvert RE’s to ND finite state machineConvert ND FSM to deterministic FSMConvert ND FSM to deterministic FSMProgram the FSMProgram the FSM
Use an automated programUse an automated programTo achieve above three stepsTo achieve above three steps
Ad Hoc Code (forget FSM’s)Ad Hoc Code (forget FSM’s)
Write normal hand codeWrite normal hand codeA procedure called ScanA procedure called ScanNormal coding techniquesNormal coding techniques
Basically scan over white space and Basically scan over white space and comments till non-blank character found.comments till non-blank character found.
Base subsequent processing on characterBase subsequent processing on characterE.g. colon may be : or :=E.g. colon may be : or := / may be operator or start of comment/ may be operator or start of comment
Return token foundReturn token foundWrite aggressive efficient codeWrite aggressive efficient code
Using FSM FormalismsUsing FSM Formalisms
Start with regular grammar or REStart with regular grammar or RETypically found in the language standardTypically found in the language standard
For example, for Ada:For example, for Ada:Chapter 2. Lexical ElementsChapter 2. Lexical Elements
Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer]decimal-literal ::= integer [.integer]
[exponent][exponent]integer ::= digit {[underline] digit}integer ::= digit {[underline] digit}exponent ::= E [+] integer | E - integerexponent ::= E [+] integer | E - integer
Using FSM formalisms, contUsing FSM formalisms, cont
Given RE’s or grammarGiven RE’s or grammarConvert to finite state machineConvert to finite state machineConvert ND FSM to deterministic FSMConvert ND FSM to deterministic FSM
Write a program to recognizeWrite a program to recognizeUsing the deterministic FSMUsing the deterministic FSM
Implementing FSM (Method Implementing FSM (Method 1)1)
Each state is code of the form:Each state is code of the form: <<state1>><<state1>>
case Next_Character iscase Next_Character iswhen ‘a’ => goto state3;when ‘a’ => goto state3;when ‘b’ => goto state1;when ‘b’ => goto state1;when others =>when others => End_of_token_processing; End_of_token_processing;
end case;end case; <<state2>><<state2>>
……
Implementing FSM (Method Implementing FSM (Method 2)2)
There is a variable called StateThere is a variable called Statelooploop
case State is case State is when state1 =><<state1>> when state1 =><<state1>>
case Next_Character is case Next_Character is when ‘a’ => State := state3; when ‘a’ => State := state3; when ‘b’ => State := state1; when ‘b’ => State := state1; when others => when others =>
End_token_processing;End_token_processing; end case; end case;
when state2 … when state2 … … …
end case; end case;end loop;end loop;
Implementing FSM (Method Implementing FSM (Method 3)3)
T : array (State, Character) of State;T : array (State, Character) of State;while More_Input loopwhile More_Input loop Curstate := T (Curstate, Curstate := T (Curstate, Next_Char);Next_Char); if Curstate = Error_State then … if Curstate = Error_State then …end loop;end loop;
Automatic FSM GenerationAutomatic FSM Generation
Our example, FLEXOur example, FLEXSee home page for manual in HTMLSee home page for manual in HTML
FLEX is givenFLEX is givenA set of regular expressionsA set of regular expressionsActions associated with each REActions associated with each RE
It builds a scannerIt builds a scannerWhich matches RE’s and executes Which matches RE’s and executes
actionsactions
Flex General FormatFlex General Format
Input to Flex is a set of rules:Input to Flex is a set of rules:Regexp actions (C statements)Regexp actions (C statements)Regexp actions (C statements)Regexp actions (C statements)……
Flex scans the longest matching Flex scans the longest matching RegexpRegexpAnd executes the corresponding actionsAnd executes the corresponding actions
An Example of a Flex scannerAn Example of a Flex scanner DIGIT DIGIT [0-9][0-9]
IDID [a-z][a-z0-9]*[a-z][a-z0-9]*%%%%{DIGIT}+{DIGIT}+ {{
printf (“an integer %s (%d)\n”, printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); yytext, atoi (yytext));
}}
{DIGIT}+”.”{DIGIT}* {{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, printf (“a float %s (%g)\n”, yytext, atof (yytext)); yytext, atof (yytext));
if|then|begin|end|procedure|function {if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext)); printf (“a keyword: %s\n”, yytext));
Flex Example (continued)Flex Example (continued)
{ID}{ID} printf (“an identifier %s\n”, yytext); printf (“an identifier %s\n”, yytext);
“+”|“-”|“*”|“/” {“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } printf (“an operator %s\n”, yytext); }
““--”.*\n /* eat Ada style comment */--”.*\n /* eat Ada style comment */
[ \t\n]+ /* eat white space */[ \t\n]+ /* eat white space */
. printf (“unrecognized character”);. printf (“unrecognized character”);%% %%
Assembling the flex programAssembling the flex program
%{%{#include <math.h> /* for atof */#include <math.h> /* for atof */%}%}
<<flex text we gave goes here>><<flex text we gave goes here>>
%%%%main (argc, argv)main (argc, argv)int argc;int argc;char **argv;char **argv;{{
yyin = fopen (argv[1], “r”);yyin = fopen (argv[1], “r”);yylex();yylex();
}}
Running flexRunning flex
flex is a program that is executedflex is a program that is executedThe input is as we have givenThe input is as we have givenThe output is a running C programThe output is a running C program
For Ada fansFor Ada fansLook at aflex (Look at aflex (www.adapower.comwww.adapower.com))
For C++ fansFor C++ fansflex can run in C++ modeflex can run in C++ mode
Generates appropriate classesGenerates appropriate classes
Choice Between Methods?Choice Between Methods?
Hand written scannersHand written scannersTypically much faster executionTypically much faster executionAnd pretty easy to writeAnd pretty easy to writeAnd a easier for good error recoveryAnd a easier for good error recovery
Flex approachFlex approachSimple to UseSimple to UseEasy to modify token languageEasy to modify token language
The GNAT ScannerThe GNAT Scanner
Hand written (scn.adb/scn.ads)Hand written (scn.adb/scn.ads) Basically a call doesBasically a call does
Super quick scan past blanks/comments etcSuper quick scan past blanks/comments etcBig case statementBig case statementProcess based on first characterProcess based on first characterCall special routinesCall special routines
Namet.Get_Name for identifier (hashing)Namet.Get_Name for identifier (hashing) Keywords recognized by special hashKeywords recognized by special hash Strings (stringt.ads)Strings (stringt.ads) Integers (uintp.ads)Integers (uintp.ads) Reals (ureal.ads)Reals (ureal.ads)
More on the GNAT ScannerMore on the GNAT Scanner
Entire source read into memoryEntire source read into memorySingle contiguous blockSingle contiguous blockSource location is index into this blockSource location is index into this blockDifferent index range for each source Different index range for each source
filefileSee sinput.adb/ads for source mgmtSee sinput.adb/ads for source mgmt
See scans.ads for definitions of See scans.ads for definitions of tokenstokens
More on GNAT ScannerMore on GNAT Scanner
Read scn.adb codeRead scn.adb codeVery easy reading, e.g.Very easy reading, e.g.
ASSIGNMENT TWOASSIGNMENT TWO
Write a flex or aflex programWrite a flex or aflex programRecognize tokens of Algol-68s programRecognize tokens of Algol-68s programPrint out tokens in style of flex examplePrint out tokens in style of flex exampleExtra creditExtra credit
Build hash table for identifiersBuild hash table for identifiersOutput hash table keyOutput hash table key
PreprocessorsPreprocessors
Some languages allow preprocessingSome languages allow preprocessingThis is a separate stepThis is a separate step
Input is sourceInput is sourceOutput is expanded sourceOutput is expanded source
Can either be done as separate phaseCan either be done as separate phaseOr embedded into the lexical analyzerOr embedded into the lexical analyzerOften done as separate phaseOften done as separate phase
Need to keep track of source locations Need to keep track of source locations
Nasty GlitchesNasty Glitches
Separation of tokensSeparation of tokens Not all languages have clear rulesNot all languages have clear rules FORTRAN has optional spacesFORTRAN has optional spaces
DO10I=1.6DO10I=1.6 identifier operator literalidentifier operator literal DO10I = 1.6DO10I = 1.6
DO10I=1,6DO10I=1,6 Keyword stmt loopvar operator literal punc literalKeyword stmt loopvar operator literal punc literal DO 10 I = 1 , 6DO 10 I = 1 , 6
Modern languages avoid this kind of thing!Modern languages avoid this kind of thing!