Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.

Lexical Analysis and Lexical Analysis and ScanningScanning

Honors CompilersHonors Compilers

Feb 5Feb 5thth 2001 2001

Robert DewarRobert Dewar

The InputThe Input

Read string inputRead string inputMight be sequence of characters (Unix)Might be sequence of characters (Unix)Might be sequence of lines (VMS)Might be sequence of lines (VMS)Character setCharacter set

ASCIIASCIIISO Latin-1ISO Latin-1ISO 10646 (16-bit = unicode)ISO 10646 (16-bit = unicode)Others (EBCDIC, JIS, etc) Others (EBCDIC, JIS, etc)

The OutputThe Output

A series of A series of tokenstokensPunctuation ( ) ; , [ ]Punctuation ( ) ; , [ ]Operators + - ** :=Operators + - ** :=Keywords begin end ifKeywords begin end ifIdentifiersIdentifiers Square_Root Square_RootString literals “hello this is a string”String literals “hello this is a string”Character literals ‘x’Character literals ‘x’Numeric literals 123 4_5.23e+2 Numeric literals 123 4_5.23e+2

16#ac# 16#ac#

Free form vs Fixed formFree form vs Fixed form

Free form languagesFree form languagesWhite space does not matterWhite space does not matter

Tabs, spaces, new lines, carriage returnsTabs, spaces, new lines, carriage returnsOnly the ordering of tokens is importantOnly the ordering of tokens is important

Fixed format languagesFixed format languagesLayout is criticalLayout is critical

Fortran, label in cols 1-6Fortran, label in cols 1-6COBOL, area A BCOBOL, area A BLexical analyzer must worry about layoutLexical analyzer must worry about layout

PunctuationPunctuation

Typically individual special Typically individual special characterscharactersSuch as + -Such as + -Lexical analyzer does not know : from :Lexical analyzer does not know : from :Sometimes double charactersSometimes double characters

E.g. (* treated as a kind of bracketE.g. (* treated as a kind of bracketReturned just as identity of tokenReturned just as identity of token

And perhaps locationAnd perhaps locationFor error message and debugging purposesFor error message and debugging purposes

OperatorsOperators

Like punctuationLike punctuationNo real difference for lexical analyzerNo real difference for lexical analyzerTypically single or double special charsTypically single or double special chars

Operators + -Operators + -Operations :=Operations :=

Returned just as identity of tokenReturned just as identity of tokenAnd perhaps locationAnd perhaps location

KeywordsKeywords

Reserved identifiersReserved identifiersE.g. E.g. BEGIN ENDBEGIN END in Pascal, in Pascal, ifif in C in CMaybe distinguished from identifiersMaybe distinguished from identifiers

E.g. mode vs E.g. mode vs modemode in Algol-68 in Algol-68Returned just as token identityReturned just as token identity

With possible location informationWith possible location informationUnreserved keywords (e.g. PL/1)Unreserved keywords (e.g. PL/1)

Handled as identifiers (parser distinguishes)Handled as identifiers (parser distinguishes)

IdentifiersIdentifiers

Rules differRules differLength, allowed characters, separatorsLength, allowed characters, separators

Need to build tableNeed to build tableSo that junk1 is recognized as junk1So that junk1 is recognized as junk1Typical structure: hash tableTypical structure: hash table

Lexical analyzer returns token typeLexical analyzer returns token typeAnd key to table entryAnd key to table entryTable entry includes location Table entry includes location

informationinformation

More on Identifier TablesMore on Identifier Tables

Most common structure is hash tableMost common structure is hash tableWith fixed number of headersWith fixed number of headersChain according to hash codeChain according to hash codeSerial search on one chainSerial search on one chainHash code computed from charactersHash code computed from charactersNo hash code is perfect!No hash code is perfect!Avoid any arbitrary limitsAvoid any arbitrary limits

String LiteralsString Literals

Text must be storedText must be storedActual characters are importantActual characters are important

Not like identifiersNot like identifiersCharacter set issuesCharacter set issuesTable neededTable needed

Lexical analyzer returns key to tableLexical analyzer returns key to tableMay or may not be worth hashingMay or may not be worth hashing

Character LiteralsCharacter Literals

Similar issues to string literalsSimilar issues to string literalsLexical Analyzer returnsLexical Analyzer returns

Token typeToken typeIdentity of characterIdentity of character

Note, cannot assume character set of Note, cannot assume character set of host machine, may be differenthost machine, may be different

Numeric LiteralsNumeric Literals

Also need a tableAlso need a tableTypically record valueTypically record value

E.g. 123 = 0123 = 01_23 (Ada)E.g. 123 = 0123 = 01_23 (Ada)But cannot use But cannot use intint for values for values

Because may have different characteristicsBecause may have different characteristicsFloat stuff much more complexFloat stuff much more complex

Denormals, correct roundingDenormals, correct roundingVery delicate stuffVery delicate stuff

Handling CommentsHandling Comments

Comments have no effect on programComments have no effect on programCan therefore be eliminated by Can therefore be eliminated by

scannerscannerBut may need to be retrieved by toolsBut may need to be retrieved by toolsError detection issuesError detection issues

E.g. unclosed commentsE.g. unclosed commentsScanner does not return commentsScanner does not return comments

Case EquivalenceCase Equivalence

Some languages have case Some languages have case equivalenceequivalencePascal, AdaPascal, Ada

Some do notSome do notC, JavaC, Java

Lexical analyzer ignores case if Lexical analyzer ignores case if neededneededThis_Routine = THIS_RouTineThis_Routine = THIS_RouTineError analysis may need exact casingError analysis may need exact casing

Issues to AddressIssues to Address

SpeedSpeedLexical analysis can take a lot of timeLexical analysis can take a lot of timeMinimize processing per characterMinimize processing per character

I/O is also an issue (read large blocks)I/O is also an issue (read large blocks)We compile frequentlyWe compile frequently

Compilation time is importantCompilation time is importantEspecially during developmentEspecially during development

General ApproachGeneral Approach

Define set of token codesDefine set of token codesAn enumeration typeAn enumeration typeA series of integer definitionsA series of integer definitionsThese are just codes (no semantics)These are just codes (no semantics)Some codes associated with dataSome codes associated with data

E.g. key for identifier tableE.g. key for identifier tableMay be useful to build tree nodeMay be useful to build tree node

For identifiers, literals etcFor identifiers, literals etc

Interface to Lexical AnalyzerInterface to Lexical Analyzer

Convert entire file to a file of tokensConvert entire file to a file of tokensLexical analyzer is separate phaseLexical analyzer is separate phase

Parser calls lexical analyzerParser calls lexical analyzerGet next tokenGet next tokenThis approach avoids extra I/OThis approach avoids extra I/OParser builds tree as we go alongParser builds tree as we go along

Implementation of ScannerImplementation of Scanner

Given the input textGiven the input textGenerate the required tokensGenerate the required tokensOr provide token by token on Or provide token by token on

demanddemandBefore we describe implementationsBefore we describe implementations

We take this short breakWe take this short breakTo describe relevant formalismsTo describe relevant formalisms

Relevant FormalismsRelevant Formalisms

Type 3 (Regular) GrammarsType 3 (Regular) GrammarsRegular ExpressionsRegular ExpressionsFinite State MachinesFinite State Machines

Regular GrammarsRegular Grammars

Regular grammarsRegular grammars Non-terminals (arbitrary names)Non-terminals (arbitrary names) Terminals (characters)Terminals (characters) Two forms of rulesTwo forms of rules

Non-terminal ::= terminalNon-terminal ::= terminal Non-terminal ::= terminal Non-terminalNon-terminal ::= terminal Non-terminal

One non-terminal is the start symbolOne non-terminal is the start symbol Regular (type 3) grammars cannot countRegular (type 3) grammars cannot count

No concept of matching nested parensNo concept of matching nested parens

Regular GrammarsRegular Grammars

Regular grammarsRegular grammarsE.g. grammar of reals with no exponentE.g. grammar of reals with no exponent

REAL ::= 0 REAL1 (repeat for 1 .. 9)REAL ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= 0 REAL1 (repeat for 1 .. 9)REAL1 ::= . INTEGER REAL1 ::= . INTEGER INTEGER ::= 0 INTEGER (repeat for 1 .. 9)INTEGER ::= 0 INTEGER (repeat for 1 .. 9)INTEGER ::= 0 (repeat for 1 .. 9)INTEGER ::= 0 (repeat for 1 .. 9)

Start symbol is REALStart symbol is REAL

Regular ExpressionsRegular Expressions

Regular expressions (RE) defined byRegular expressions (RE) defined byAny terminal character is an REAny terminal character is an REAlternation RE | REAlternation RE | REConcatenation RE1 RE2Concatenation RE1 RE2Repetition RE* (zero or more RE’s)Repetition RE* (zero or more RE’s)

Language of RE’s = type 3 grammarsLanguage of RE’s = type 3 grammarsRegular expressions are more Regular expressions are more

convenientconvenient

Specifying RE’s in Unix ToolsSpecifying RE’s in Unix Tools

Single characters a b c d \xSingle characters a b c d \xAlternation [bcd] [b-z] ab|cdAlternation [bcd] [b-z] ab|cdMatch any character .Match any character .Match sequence of characters x* y+Match sequence of characters x* y+Concatenation abc[d-q]Concatenation abc[d-q]Optional [0-9]+(.[0-9]*)?Optional [0-9]+(.[0-9]*)?

Finite State MachinesFinite State Machines

Languages and AutomataLanguages and AutomataA language is a set of stringsA language is a set of stringsAn automaton is a machineAn automaton is a machine

That determines if a given string is in That determines if a given string is in the language or not.the language or not.

FSM’s are automata that recognize FSM’s are automata that recognize regular languages (regular regular languages (regular expressions) expressions)

Definitions of FSMDefinitions of FSM

A set of labeled statesA set of labeled statesDirected arcs labeled with characterDirected arcs labeled with characterA state may be marked as terminalA state may be marked as terminalTransition from state S1 to S2Transition from state S1 to S2

If and only if arc from S1 to S2If and only if arc from S1 to S2Labeled with next character (which is eaten)Labeled with next character (which is eaten)

Recognized if ends up in terminal Recognized if ends up in terminal statestate

One state is distinguished start stateOne state is distinguished start state

Building FSM from GrammarBuilding FSM from Grammar

One state for each non-terminalOne state for each non-terminalA rule of the formA rule of the form

Nont1 ::= terminalNont1 ::= terminalGenerates transition from S1 to final Generates transition from S1 to final

statestateA rule of the formA rule of the form

Nont1 ::= terminal Nont2Nont1 ::= terminal Nont2Generates transition from S1 to S2Generates transition from S1 to S2

Building FSM’s from RE’sBuilding FSM’s from RE’s

Every RE corresponds to a grammarEvery RE corresponds to a grammarFor all regular expressionsFor all regular expressions

A natural translation to FSM existsA natural translation to FSM existsWe will not give details of algorithm We will not give details of algorithm

herehere

Non-Deterministic FSMNon-Deterministic FSM

A non-deterministic FSMA non-deterministic FSMHas at least one stateHas at least one state

With two arcs to two separate statesWith two arcs to two separate statesLabeled with the same characterLabeled with the same character

Which way to go?Which way to go?Implementation requires backtrackingImplementation requires backtrackingNasty Nasty

Deterministic FSMDeterministic FSM

For all states SFor all states SFor all characters CFor all characters C

There is either ONE or NO arcsThere is either ONE or NO arcsFrom state SFrom state SLabeled with character CLabeled with character C

Much easier to implementMuch easier to implementNo backtracking No backtracking

Dealing with ND FSMDealing with ND FSM

Construction naturally leads to ND Construction naturally leads to ND FSMFSM

For example, consider FSM forFor example, consider FSM for[0-9]+ | [0-9]+\.[0-9]+[0-9]+ | [0-9]+\.[0-9]+

(integer or real)(integer or real)We will naturally get a start stateWe will naturally get a start state

With two sets of 0-9 branchesWith two sets of 0-9 branchesAnd thus non-deterministicAnd thus non-deterministic

Converting to DeterministicConverting to Deterministic

There is an algorithm for convertingThere is an algorithm for convertingFrom any ND FSMFrom any ND FSM

To an equivalent deterministic FSMTo an equivalent deterministic FSM

Algorithm is in the text bookAlgorithm is in the text bookExample (given in terms of RE’s)Example (given in terms of RE’s)

[0-9]+ | [0-9]+\.[0-9]+[0-9]+ | [0-9]+\.[0-9]+[0-9]+(\.[0-9]+)?[0-9]+(\.[0-9]+)?

Implementing the ScannerImplementing the Scanner

Three methodsThree methodsCompletely informal, just write codeCompletely informal, just write codeDefine tokens using regular expressionsDefine tokens using regular expressions

Convert RE’s to ND finite state machineConvert RE’s to ND finite state machineConvert ND FSM to deterministic FSMConvert ND FSM to deterministic FSMProgram the FSMProgram the FSM

Use an automated programUse an automated programTo achieve above three stepsTo achieve above three steps

Ad Hoc Code (forget FSM’s)Ad Hoc Code (forget FSM’s)

Write normal hand codeWrite normal hand codeA procedure called ScanA procedure called ScanNormal coding techniquesNormal coding techniques

Basically scan over white space and Basically scan over white space and comments till non-blank character found.comments till non-blank character found.

Base subsequent processing on characterBase subsequent processing on characterE.g. colon may be : or :=E.g. colon may be : or := / may be operator or start of comment/ may be operator or start of comment

Return token foundReturn token foundWrite aggressive efficient codeWrite aggressive efficient code

Using FSM FormalismsUsing FSM Formalisms

Start with regular grammar or REStart with regular grammar or RETypically found in the language standardTypically found in the language standard

For example, for Ada:For example, for Ada:Chapter 2. Lexical ElementsChapter 2. Lexical Elements

Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer]decimal-literal ::= integer [.integer]

[exponent][exponent]integer ::= digit {[underline] digit}integer ::= digit {[underline] digit}exponent ::= E [+] integer | E - integerexponent ::= E [+] integer | E - integer

Using FSM formalisms, contUsing FSM formalisms, cont

Given RE’s or grammarGiven RE’s or grammarConvert to finite state machineConvert to finite state machineConvert ND FSM to deterministic FSMConvert ND FSM to deterministic FSM

Write a program to recognizeWrite a program to recognizeUsing the deterministic FSMUsing the deterministic FSM

Implementing FSM (Method Implementing FSM (Method 1)1)

Each state is code of the form:Each state is code of the form: <<state1>><<state1>>

case Next_Character iscase Next_Character iswhen ‘a’ => goto state3;when ‘a’ => goto state3;when ‘b’ => goto state1;when ‘b’ => goto state1;when others =>when others => End_of_token_processing; End_of_token_processing;

end case;end case; <<state2>><<state2>>

……


There is a variable called StateThere is a variable called Statelooploop

case State is case State is when state1 =><<state1>> when state1 =><<state1>>

case Next_Character is case Next_Character is when ‘a’ => State := state3; when ‘a’ => State := state3; when ‘b’ => State := state1; when ‘b’ => State := state1; when others => when others =>

End_token_processing;End_token_processing; end case; end case;

when state2 … when state2 … … …

end case; end case;end loop;end loop;


T : array (State, Character) of State;T : array (State, Character) of State;while More_Input loopwhile More_Input loop Curstate := T (Curstate, Curstate := T (Curstate, Next_Char);Next_Char); if Curstate = Error_State then … if Curstate = Error_State then …end loop;end loop;

Automatic FSM GenerationAutomatic FSM Generation

Our example, FLEXOur example, FLEXSee home page for manual in HTMLSee home page for manual in HTML

FLEX is givenFLEX is givenA set of regular expressionsA set of regular expressionsActions associated with each REActions associated with each RE

It builds a scannerIt builds a scannerWhich matches RE’s and executes Which matches RE’s and executes

actionsactions

Flex General FormatFlex General Format

Input to Flex is a set of rules:Input to Flex is a set of rules:Regexp actions (C statements)Regexp actions (C statements)Regexp actions (C statements)Regexp actions (C statements)……

Flex scans the longest matching Flex scans the longest matching RegexpRegexpAnd executes the corresponding actionsAnd executes the corresponding actions

An Example of a Flex scannerAn Example of a Flex scanner DIGIT DIGIT [0-9][0-9]

IDID [a-z][a-z0-9]*[a-z][a-z0-9]*%%%%{DIGIT}+{DIGIT}+ {{

printf (“an integer %s (%d)\n”, printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); yytext, atoi (yytext));

}}

{DIGIT}+”.”{DIGIT}* {{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, printf (“a float %s (%g)\n”, yytext, atof (yytext)); yytext, atof (yytext));

if|then|begin|end|procedure|function {if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext)); printf (“a keyword: %s\n”, yytext));

Flex Example (continued)Flex Example (continued)

{ID}{ID} printf (“an identifier %s\n”, yytext); printf (“an identifier %s\n”, yytext);

“+”|“-”|“*”|“/” {“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } printf (“an operator %s\n”, yytext); }

““--”.*\n /* eat Ada style comment */--”.*\n /* eat Ada style comment */

[ \t\n]+ /* eat white space */[ \t\n]+ /* eat white space */

. printf (“unrecognized character”);. printf (“unrecognized character”);%% %%

Assembling the flex programAssembling the flex program

%{%{#include <math.h> /* for atof */#include <math.h> /* for atof */%}%}

<<flex text we gave goes here>><<flex text we gave goes here>>

%%%%main (argc, argv)main (argc, argv)int argc;int argc;char **argv;char **argv;{{

yyin = fopen (argv[1], “r”);yyin = fopen (argv[1], “r”);yylex();yylex();

}}

Running flexRunning flex

flex is a program that is executedflex is a program that is executedThe input is as we have givenThe input is as we have givenThe output is a running C programThe output is a running C program

For Ada fansFor Ada fansLook at aflex (Look at aflex (www.adapower.comwww.adapower.com))

For C++ fansFor C++ fansflex can run in C++ modeflex can run in C++ mode

Generates appropriate classesGenerates appropriate classes

Choice Between Methods?Choice Between Methods?

Hand written scannersHand written scannersTypically much faster executionTypically much faster executionAnd pretty easy to writeAnd pretty easy to writeAnd a easier for good error recoveryAnd a easier for good error recovery

Flex approachFlex approachSimple to UseSimple to UseEasy to modify token languageEasy to modify token language

The GNAT ScannerThe GNAT Scanner

Hand written (scn.adb/scn.ads)Hand written (scn.adb/scn.ads) Basically a call doesBasically a call does

Super quick scan past blanks/comments etcSuper quick scan past blanks/comments etcBig case statementBig case statementProcess based on first characterProcess based on first characterCall special routinesCall special routines

Namet.Get_Name for identifier (hashing)Namet.Get_Name for identifier (hashing) Keywords recognized by special hashKeywords recognized by special hash Strings (stringt.ads)Strings (stringt.ads) Integers (uintp.ads)Integers (uintp.ads) Reals (ureal.ads)Reals (ureal.ads)

More on the GNAT ScannerMore on the GNAT Scanner

Entire source read into memoryEntire source read into memorySingle contiguous blockSingle contiguous blockSource location is index into this blockSource location is index into this blockDifferent index range for each source Different index range for each source

filefileSee sinput.adb/ads for source mgmtSee sinput.adb/ads for source mgmt

See scans.ads for definitions of See scans.ads for definitions of tokenstokens

More on GNAT ScannerMore on GNAT Scanner

Read scn.adb codeRead scn.adb codeVery easy reading, e.g.Very easy reading, e.g.

ASSIGNMENT TWOASSIGNMENT TWO

Write a flex or aflex programWrite a flex or aflex programRecognize tokens of Algol-68s programRecognize tokens of Algol-68s programPrint out tokens in style of flex examplePrint out tokens in style of flex exampleExtra creditExtra credit

Build hash table for identifiersBuild hash table for identifiersOutput hash table keyOutput hash table key

PreprocessorsPreprocessors

Some languages allow preprocessingSome languages allow preprocessingThis is a separate stepThis is a separate step

Input is sourceInput is sourceOutput is expanded sourceOutput is expanded source

Can either be done as separate phaseCan either be done as separate phaseOr embedded into the lexical analyzerOr embedded into the lexical analyzerOften done as separate phaseOften done as separate phase

Need to keep track of source locations Need to keep track of source locations

Nasty GlitchesNasty Glitches

Separation of tokensSeparation of tokens Not all languages have clear rulesNot all languages have clear rules FORTRAN has optional spacesFORTRAN has optional spaces

DO10I=1.6DO10I=1.6 identifier operator literalidentifier operator literal DO10I = 1.6DO10I = 1.6

DO10I=1,6DO10I=1,6 Keyword stmt loopvar operator literal punc literalKeyword stmt loopvar operator literal punc literal DO 10 I = 1 , 6DO 10 I = 1 , 6

Modern languages avoid this kind of thing!Modern languages avoid this kind of thing!

Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.

Documents

Transcript of Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar.