Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v14.6.18.

16
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v14.6.18

Transcript of Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v14.6.18.

Introduction to Lexical Analysis

and the Flex Tool.

© Allan C. Milne

Abertay University

v14.6.18

Agenda.

• Lexical analysis & tokens.• What is Flex?

• Lex program structure.• Regular expression patterns.

• Examples.

The context.

• BNF defines the syntax of a language.• A source program is written in that language.• The compiler processes the source program

according to the rules of the BNF.– We do not want to do this processing in terms

of the individual characters of the program.

• The first step is to identify the lexical elements (tokens) of the source program.– identify groups of characters that form the

tokens; keywords, punctuation, microsyntax.

The lexical analyser.

• Also known as the scanner.

• Its role:– to transform an input stream of characters– into tokens,– and expose these tokens to the rest of the

compiler.

if (x==10) …

Character stream

"if" "(" "<identifier>" "==" "<integer>" ")" …

Output tokens

What are tokens?

• Tokens are the internal compiler representations of the terminals of a source program as defined by the language BNF.

• Simple terminals– keywords; e.g. begin, for, if, …– single character punctuation; e.g. {, =, …– multi-character punctuation; e.g. ==, <=, ->, …

• Microsyntax terminals– defined in the microsyntax of the BNF.– e.g. identifiers, literal constants.

Creating a scanner.

• Write a bespoke scanner, usually based around a finite state machine (FSM).

• Use a utility to process a language specification and automatically generate a scanner.

What is Flex?

• Lex was developed in the mid-1970’s as a utility to generate a lexical analyser in C.

• Flex is a Lex clone developed by the Gnu foundation as a free download.

• It generates a scanner in C/C++ from a Lex program that specifies token patterns and associated actions.

Processing with Flex.

• flex –obase.yy.c base.l– base.l : Lex program defining patterns/actions.– base.yy.c : generated C scanner.

• cl /Febase.exe base.yy.c– base.exe : the executable scanner.

• Base– Execute the scanner against standard input (keyboard).

• Base <file– Execute scanner redirecting input from the named file.

Lex program structure. %{ … C declarations … %} … Lex definitions …

%%

… Lex rules of the form …pattern { … C actions … }%%

… C functions …

%{ int allanCount;%}

%%

[aA]ll?an { allanCount++; }

. ;

%%

int yywrap () { return 1;}

int main () { yylex(); printf ("Input contains %d Allan’s. \n", allanCount); return 0; }

Generated scanner operation.

• Input is matched character by character to the patterns in the rules section.

• The longest pattern match then causes the associated actions to be executed.– If no match then the character is copied to the

output.

• Defaults for input and output are stdin and stdout.– These can be changed by assigning to the

predefined variables• FILE *yyin, *yyout

Simplest Lex program.%%

%%

int yywrap () { return 1; }

int main () { yylex(); return 0; }

• Copies the input to the output.• Note that yywrap() and main() are

generated automatically by some Lex implementations.

• yywrap() indicates whether or not wrap-up is complete.– Almost always return 1 here.– Called by Lex when input is exhausted.

• As usual main() is the program entry point.– Calls yylex() to initiate the lexer.

Rules section.

• Each rule is a pattern / action pair.– Patterns must start in column 1;– Followed by whitespace;– Then an optional C statement or {…} C block.

• Any text not starting in column 1 is copied verbatim to the generated C program;– e.g. comments.

Patterns.

• A pattern is a regular expression composed of constant characters and meta-characters.

• Review the Lex meta-character document for a list of the meta-characters and some pattern examples.

• It is the creation of patterns using combinations of meta-characters that is the core of creating a Lex program.

Character, word and line counter.

%{ int lineCount, wordCount, charCount; %}

%%

\n lineCount++;

[^ \t\n]+ { wordCount++; charCount+=yyleng; }

. charCount++;

%%

int yywrap () { return 1; }

int main () { yylex(); printf ("Input contains \n"); printf ("%d lines\n%d words\n%d chars\n", lineCount, wordCount, charCount); return 0;

}

Pattern matching.• Input is matched character by character to

the patterns in the rules section.• A pattern match causes the associated

actions to be executed.– If two patterns match then

• the longest match is used;• if matches are of equal length then the first pattern

is used.

• If no match then the input character is copied to the output.

Pattern definitions.

• Patterns may be given a name in the definitions section and this name used in a rule pattern by enclosing the name in {…}.

LETTER [a-zA-Z]

%{ int wc; %}

%%

{LETTER}+ wc++;

.|\n ;

%%

int yywrap() { return 1; }

int main() { yylex(); printf ("%d words found.\n", wc); return 0;}