Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler...

36
Lexical Analysis Mooly Sagiv html://www.cs.tau.ac.il/~msagiv/ courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler...

Page 1: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Lexical Analysis

Mooly Sagiv

html://www.cs.tau.ac.il/~msagiv/courses/wcc02.htmlTextbook:Modern Compiler Implementation in C

Chapter 2

Page 2: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

A motivating example• Create a program that counts the number of lines in

a given input text file

Page 3: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Solution

int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Page 4: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Solution

int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

initial

;

newline

\n

other

Page 5: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Subjects• Roles of lexical analysis

• The straightforward solution a manual scanner for C

• Regular Expressions

• Finite automata

• From regular languages into finite automata

• Flex

Page 6: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Basic Compiler Phases

Source program (string)

Fin. Assembly

lexical analysis

syntax analysis

semantic analysis

Tokens

Abstract syntax tree

Front-End

Back-End

Page 7: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Example Tokens

Type Examples

ID foo n_14 last

NUM 73 00 517 082

REAL 66.1 .5 10. 1e67 5.5e-10

IF if

COMMA ,

NOTEQ !=

LPAREN (

RPAREN )

Page 8: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Example NonTokens

Type Examples

comment /* ignored */

preprocessor directive #include <foo.h>

#define NUMS 5, 6

macro NUMS

whitespace \t \n \b

Page 9: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Example

void match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0. ;

}

VOID ID(match0) LPAREN CHAR DEREF ID(s)

RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3)

RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

Page 10: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

• input

– program text (file)

• output

– sequence of tokens

• Read input file

• Identify language keywords and standard identifiers

• Handle include files and macros

• Count line numbers

• Remove whitespaces

• Report illegal symbols

• Produce symbol table

Lexical Analysis (Scanning)

Page 11: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

• Simplifies the syntax analysis– And language definition

• Modularity

• Reusability

• Efficiency

Why Lexical Analysis

Page 12: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Writing Scanners in C…

Page 13: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){

case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;

switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c);

return Plus; } case `<`:case `w`:

}

Page 14: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Automatic Generation of Lexical Analysis

• Every Token can be defined using a regular expression

• Examples:– A regular expression for C identifier– A regular expression for strings

• A minimal finite automaton for Token recognition can be automatically generated

Page 15: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Flex• Input

– regular expressions and actions (C code)

• Output– A scanner program that reads the input and

applies actions when input regular expression is matched

flex

regular expressions

input program tokensscanner

Page 16: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Regular Expression Notations

a An ordinary character stands for itselfM|N M or NMN M followed by NM* Zero or more times of MM+ One or more times of MM? Zero or one occurrence of M[a-zA-Z] Character set alternation (single character). Any (single) character but newline“a.+” Quotation\ Convert an operator into text

Page 17: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Ambiguity Resolving

• Find the longest matching token

• Between two tokens with the same length use the one declared first

Page 18: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

A Flex specification of C Scanner

Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=“ { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}

Page 19: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

How to Implement Ambiguity Resolving

• Between two tokens with the same length use the one declared first

• Find the longest matching token

Page 20: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Running Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 21: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... *//* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}/* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13}/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0}/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}/* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}/* state 7 */

.../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Page 22: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][*currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal ;return action[lastFinal]; }

Page 23: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Example

Input: “if --not-a-com”

Page 24: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

final state input

0 1 if --not-a-com

2 2 if --not-a-com

3 3 if --not-a-com

3 0 if --not-a-comreturn IF

Page 25: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

final state input

0 1 --not-a-com

12 12 --not-a-com

12 0 --not-a-com

found whitespace

Page 26: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

final state input

0 1 --not-a-com

9 9 --not-a-com

9 10 --not-a-com

9 10 --not-a-com

9 10 --not-a-com

9 0 --not-a-com

error

Page 27: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

final state input

0 1 -not-a-com

9 9 -not-a-com

9 0 -not-a-com

error

Page 28: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Efficient Scanners

• Efficient state representation

• Input buffering

• Using switch and gotos instead of tables

Page 29: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Constructing Automaton from Specification

• Create a non-deterministic automaton (NDFA) from every regular expression

• Merge all the automata using epsilon moves(like the | construction)

• Construct a deterministic finite automaton (DFA)– State priority

• Minimize the automaton starting with separate accepting states

Page 30: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 31: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

DFA Construction

Page 32: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Minimization

Page 33: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

%{/* C declarations */#include “tokens.h” /* Mapping of tokens into integers */#include “errormsg.h” /* Shared by all the phases */union {int ival; string sval; double fval;} yylval;int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng)%}/* Lex Definitions */digits [0-9]+%%if { ADJ; return IF;}[a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; }{digits} {ADJ; yylval.ival=atoi(yytext); return NUM; }({digits}\.{digits}?)|({digits}?\.{digits}) {

ADJ; yylval.fval=atof(yytext); return REAL; }(\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }

Page 34: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Start States• Regular expressions may be more complicated

than automata– C comments

• Solutions– Conversion of automata into regular expressions– Start States

% start s1 s2%%< INITIAL>r1 { action0 ; BEGIN s1; }<s1>r1 { action1 ; BEGIN s2; }<s2>r2 { action2 ; BEGIN INITIAL};

Page 35: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Realistic Example% start Comment%%<INITIAL>”/*'' { BEGIN Comment; }<INITIAL>r1 { Usual actions; }<INITIAL>r2 { Usual actions; }

...<INITIAL>rk { Usual actions; }<Comment>”*/”’ { BEGIN Initial; }<Comment>.|\n ;

Page 36: Lexical Analysis Mooly Sagiv html://msagiv/courses/wcc02.html Textbook:Modern Compiler Implementation in C Chapter 2.

Summary

• For most programming languages lexical analyzers can be easily constructed

• Exceptions:– Fortran– PL/1

• Flex is a useful tool beyond compilers