Dr. N.K Shrinath - Lex and Yacc

59
N.K. Srinath [email protected] 1 RVCE LEX (LEXical Analyzer Generator) Features: Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions.

Transcript of Dr. N.K Shrinath - Lex and Yacc

Page 1: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 1 RVCE

LEX (LEXical Analyzer Generator)

Features: Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented

specification for character string matching, andproduces a program in a general purposelanguage which recognizes regular

expressions. The regular expressions are specified by the user in the source specifications given to Lex.

Page 2: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 2 RVCE

The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Lex is widely used tool to specify lexical analyzers for a variety of languages. We refer to the tool as the Lex compiler and to its input specification as the Lex language.

Page 3: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 3 RVCE

Creating a Lexical Analyzer with Lex:

1.First a specification of a lexical analyzer is prepared by creating a program lex.l in the lex language.

2. lex.l is run through the lex compiler to produce a C program lex.yy.c.

3. Finally, lex.yy.c is run through the C compiler to produce an object program a.out.

Page 4: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 4 RVCE

LexCompiler

Lex Source Program lex . l

lex.yy.c

C Compiler

a. outlex.yy.c

Page 5: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 5 RVCE

The lexical analyses phase reads the characters in the source program and groups them into a stream of tokensin which each token represents a logically cohesive sequence of characters, such as

an identifier,

a keyword (if, while, etc.)

a punctuation character or

a multi-character operator like : =

Page 6: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 6 RVCE

Lex SourceThe general format of Lex source is:

1 {definitions}2 %%3 {rules}4 %%5 {user subroutines}

Where The line 1 - definitions and the user subroutines areoften omitted.

Page 7: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 7 RVCE

Definition section is bracketed by %{ and %}

This section can include the literal block, definitions, internal table declarations, start conditions and translations.

The information in this section is C code which is copied verbatim into the lexer.

The data definitions contained here can be referenced by code within the rules section.

Definitions

Page 8: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 8 RVCE

The line 2 - %% is absolutely required.The absolute minimum Lex program is thus

%%

The line 3 – rules may be typed.Note: (no definitions, no rules) which translates into a program which copies the input to the output unchanged.

Rules Section

The rules represent the user’s control decisions;

Page 9: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 9 RVCE

The rules section contains pattern lines and C code.

A line that start with whitespace, or material enclosed in “%{“ and “%}” is C code.

A line that starts with anything else is a pattern line.

C code lines are copied verbatim to the generated C file.

If the code is more than one statement or span multiple lines, it must be enclosed in braces { }.

Page 10: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 10 RVCE

The line 4 - The second %% is optional.The line 5 – the user subroutines.

User Subroutines The contents of the user subroutines section is copied verbatim by lex to the C file.

It can contain any valid C code.

It contains support routines.

Page 11: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 11 RVCE

%%

Main()

{

yylex();

printf(“%d, %d, %d\n”, a, b, c);

}

It first calls the lexer’s entry point yylex() and then calls printf() to print the results of this run.

Page 12: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 12 RVCE

Example 1: Write a program to delete from the input all blanks or tabs at the ends of lines.

%%[ \t]+$ ;

%% delimiter to mark the beginning of the rules.

[ \t]+$ indicate a regular expression which matches one or more instances of the characters blank or tab at the end of the line.

Page 13: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 13 RVCE

The brackets indicate the character class made of blank and tab;

the + indicates ‘‘one or more ...’’; and the $ indicates ‘‘end of line’’.

Example 2: Write a program to delete from the input all blanks or tabs at the ends of lines and insert a blant.

%%[ \t]+$ ;[ \t]+ printf(" ");

There are two rules provided. The finite automaton generated

for this source will scan for both rules at once.

Page 14: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 14 RVCE

observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. 1. The first rule matches all strings of blanks or tabs at the end of lines, and2. The second rule all remaining strings of blanks

or tabs.Lex programs recognize only regular expressions.Lex generates a deterministic finite automatonfrom the regular expressions in the source .The automaton is interpreted, rather than compiled,in order to save space.

Page 15: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 15 RVCE

LEX Regular ExpressionsA regular expression specifies a set of strings to be matched.

It contains text characters (which match the corresponding characters in the strings being compared ) and operator characters (which specify repetitions, choices, and other features).The letters of the alphabet and the digits are always text characters; thus the regular expression

integer matches the string integer wherever it appears and the expression

a57Dlooks for the string a57D.

Page 16: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 16 RVCE

Operators The operator characters are

" \ [ ] ˆ − ? . + | ( ) $ / { }%< >∗

and if they are to be used as text characters, anescape should be used. The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus

xyz"++"matches the string xyz++ when it appears.

Page 17: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 17 RVCE

Note: the expression"xyz++“ == xyz “++“ == xyz\+\+

An operator character may also be turnedinto a text character by preceding it with \ as in

xyz\+\+Another use of the quoting mechanism is to get a blank into an expression;normally, as explained above, blanks or tabs enda rule. Any blank character not contained within[ ] (see below) must be quoted.

Page 18: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 18 RVCE

Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is backspace. \ itself, use \\.

Character classes. Classes of characters can be specified using

the operator pair [ ]. [abc] matches a single character, which may be a , b, or c . Within square brackets, most operator meanings are ignored.

Page 19: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 19 RVCE

Special characters:

The − character indicates ranges. For example, [a−z0−9<>_ ]; indicates the character class containing all the lower case letters, the digits, the angle brackets, and underline.

Ranges may be given in either order.

Using − between any pair of characters which are not both upper case letters, both lower case letters, or both

1. “ −”

Page 20: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 20 RVCE

digits is implementation dependent and will get a warning message. (E.g.,[0−z] in ASCII is many more characters than it isin EBCDIC). If it is desired to include the character − in a character class, it should be first or last; thus

[−+0−9]matches all the digits and the two signs.

In character classes, the ˆ operator mustappear as the first character after the left bracket;

2. ^

Page 21: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 21 RVCE

it indicates that the resulting string is to be complemented with respect to the computer character set. Thus

[ˆabc]matches all characters except a, b, or c, includingall special or control characters; or

[ˆa−zA−Z]is any character which is not a letter.

provides the usual escapes within characterclass brackets.

3. \

Page 22: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 22 RVCE

Arbitrary character. To match almost any character, the operator character .

is the class of all characters except newline.

Optional expressions. The operator ? indicates an optional element of an expression. Thus

ab?cmatches either ac or abc .

Page 23: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 23 RVCE

Repeated expressions. Repetitions of classes are indicated by the operators ∗ and +.

a∗is any number of consecutive “a” characters, including no character; while

a+is one or more instances of a. For example,

[a−z]+is all strings of lower case letters and

[A−Za−z][A−Za−z0−9]∗indicates all alphanumeric strings with a leading alphabetic character.

Page 24: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 24 RVCE

Alternation and Grouping. The operator | indicates alternation:

(ab | cd)matches either ab or cd. Note that parentheses are used for grouping, although they are not necessary on the outside level;

ab | cdwould have sufficed. What is the output of the following expression?

(ab | cd+)?(ef)∗Output: abefef , efefef , cdef, or cddd ;

Page 25: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 25 RVCE

Context Sensitivity

If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the beginning of the input stream).

This can never conflict with the other meaning of ^, complementation of character classes, since that only applies within the [ ] operators.

Page 26: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 26 RVCE

The latter operator is a special case of the / operator character, which indicates trailing context. The expression

ab/cd

matches the string ab, but only if followed by cd. Start Condition:start conditions. If a rule is only to be executed when the Lex automaton interpreter is in start condition x, the rule should be prefixed by <x>using the angle bracket operator characters.

Page 27: Dr. N.K Shrinath - Lex and Yacc

Repetitions and DefinitionsThe operators { } specify either repetitions (if they enclose numbers) or

definition expansion (if they enclose a name). example1. {digit}looks for a predefined string named digit and inserts it at that point in the expression. 2. a{1,5}looks for 1 to 5 occurrences of a.

N.K. Srinath [email protected] 27 RVCE

Page 28: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 28 RVCE

To Print the Text matched:Lex leaves this text in an external character array named yytext.

Write the Lex statement that prints the text matched as identifier. [a-z]+ printf("%s", yytext);

The C function printf accepts a format argument and data to be printed;

The format “print string'' (% indicating data conversion, and %s indicating string type), and the data are the characters in yytext.

Page 29: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 29 RVCE

The other option is to use ECHO Example: [a-z]+ ECHO;

To find the number of character matched:Lex provides a count for the characters matched by using yyleng. Write a lex statement to count both the number of words and the number of characters in words in the input. [a-zA-Z]+ {words++; chars += yyleng;}

Page 30: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 30 RVCE

Functions:

yymore () : This function can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. Normally, the next input string would overwrite the current entry in yytext.

yyless (n): This function may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. The argument “n” indicates the number of characters in yytext to be retained.

Page 31: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 31 RVCE

yywrap() :

yywrap() is called whenever Lex reaches an end-of-file

If yywrap returns a 1, Lex continues with the normal wrapup on end of input.

It is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a yywrap which arranges for new input and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.

Page 32: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 32 RVCE

input() : Returns the next input character;output(c): Writes the character c on the output;unput(c): Pushes the character c back onto the input stream to be read later by input().

Lex does not look ahead at all if it does not have to, but every rule ending in + * ? or $ or containing / implies lookahead. Lookahead is also necessary to match an expression that is a prefix of another expression.

yylex(): It is a function in C-program for lexer produced by lex.

Page 33: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 33 RVCE

Write a Lex statement to find the last string matched.

yytext[yyleng-1]

What does the following statement mean? \"[^"]* { if (yytext[yyleng-1] == '\\') yymore(); else ... normal user processing }

Page 34: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 34 RVCE

Let us examine with an example:

Input string: "abc\" “def"

first match the five characters "abc\”; then the call to yymore() will cause the next part of the string, "def, to be tacked on the end.

Note that the final quote terminating the string should be picked up in the code labeled ``normal processing''.

Page 35: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 35 RVCE

What does this function do? =-[a-zA-Z] { printf("Op (=-) ambiguous\n"); yyless(yyleng-1); ... action for =- ... }

Prints a message, returns the letter after the operator to the input stream, and treats the operator as ``=-''.

Page 36: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 36 RVCE

that Lex is turning the rules into a program. Any source not intercepted by Lex is copied into the generated program. There are three classes of such things.

1. Any line which is not part of a Lex rule or action which begins with a blank or tab is copied into the Lex generated program. Such source input prior to the first %% delimiter will be external to any function in the code; if it appears immediately after the first %%, it appears in an appropriate place for declarations in the function written by Lex which contains the actions.

Page 37: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 37 RVCE

2) Anything included between lines containing only %{ and %} is copied out as above. The delimiters are discarded. This format permits entering text like preprocessor statements that must begin in column 1, or copying lines that do not look like programs.

3) Anything after the third %% delimiter, regardless of formats, etc., is copied out after the Lex output.

Page 38: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 38 RVCE

Any line in this section not contained between %{ and %}, and beginning in column 1, is assumed to define Lex substitution strings.

The format of such lines is name translation and it causes the string given as a translation to be associated with the name.

The name and translation must be separated by at least one blank or tab, and the name must begin with a letter.

The translation can then be called out by the {name} syntax in a rule.

Page 39: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 39 RVCE

Summary of Source Format The general form of a Lex source file is: {definitions} %%

{rules} %% {user subroutines} The definitions section contains a combination of 1) Definitions, in the form ``name space translation''. 2) Included code, in the form ``space code''. 3) Included code, in the form %{ code %}

Page 40: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 40 RVCE

4) Start conditions, given in the form %S name1 name2 ... 5) Character set tables, in the form %T

number space character-string ...

%T 6) Changes to internal array sizes, in the form %x nnn where nnn is a decimal integer representing an array size and x selects the parameter as follows:

Letter Parameter

p positions

n states

Page 41: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 41 RVCE

e tree nodes

a transitions

k packed character classes o output array size

Lines in the rules section have the form ``expression action'' where the action may be continued on succeeding lines by using braces to delimit it.

Page 42: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 42 RVCE

Regular expressions in Lex use the following operators:

x the character "x"

"x" an "x", even if x is an operator.

\x an "x", even if x is an operator.

[xy] the character x or y.

[x-z] the characters x, y or z.

[^x] any character but x.

. any character but newline.

^x an x at the beginning of a line.

<y>x an x when Lex is in start condition y.

Page 43: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 43 RVCE

x$ an x at the end of a line.

x? an optional x.

x* 0,1,2, ... instances of x.

x+ 1,2,3, ... instances of x.

x|y an x or a y.

(x) an x.

x/y an x but only if followed by y.

{xx} the translation of xx from the definitions section.

x{m,n} m through n occurrences of x

Page 44: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 44 RVCE

Examples of Regular Expressions:

[0-9] a digit

[0-9]+ An integer

[0-9]* no digit or integer

-?[0-9]+ Optional negative sign with an integer.

[0-9]*\.[0-9]+ pattern such as 0.0, 4.5, or .3154 matches. The “\” before the period is to make it a literal period rather than a wild card character. This does not match an integer.

Page 45: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 45 RVCE

([0-9]+) | ([0-9]*\.[0-9]+)

This is using grouping symbols “()”to specify what the regular expressions are for the “|” operation.

-?(([0-9]+) | ([0-9]*\.[0-9]+))

This indicates the above number with unary minus.

[eE] [-+]?[0-9]+

Regular expression for an exponent.

Page 46: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 46 RVCE

Lex specification for decimal number.

%%

[\n\t ] ;

-?(([0-9]+) | ([0-9]*\.[0-9]+) ([eE] [-+]?[0-9]+)?)

{ printf(“number\n”); }

. ECHO;

%%

{ yylex(); }

Page 47: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 47 RVCE

Problems

1.Write a lex program to find the number of positive integer and negative integer.

Solution:%{

int posnum = 0;int negnum = 0;

%}

Page 48: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 48 RVCE

%% [\n\t ]; ([0-9]+) {posnum++;} -?([0-9]+) {negnum++; }. ECHO; %% main() { yylex(); printf("Number of positive no. = %d\n", posnum); printf("number of negative no. = %d\n", negnum);}

Line 1 \n and \t are ignored.

Line 2 With at least one digit available which is positive, posnum is incremented.

Line 3 - is optional, since all the positive numbers are already accepted by line 2, only –ve numbers are accepted by this line.

Line 4 All the other characters are echoed on to the screen.

YYlex() call the lex program. The rest is the C-program.

Page 49: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 49 RVCE

2. Write a lex program to find the number of valid identifiers.%{ int count=0;%}%%(" int ")|(" float ")|(" double ")|(" char ") {

int ch; ch = input();for(;;)

{ if (ch==',') {count++;} else

Count is declared as integer and initialized as zero.

int, float, double or char are the different identifiers which are consider valid identifiers.

The data is read from the keyboard and loaded into variable ch.

The infinite for loop will check for a comma and increments a counter.

Page 50: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 50 RVCE

if( ch == ';‘ ) {break;}ch = input();

} count++;}%%main(int argc,char *argv[]){yyin=fopen(argv[1],"r");yylex();printf("the no of identifiers used is %d\n",count);}

If the character is a “;” then it breaks else it reads the next input.

The counter is incremented to the next value for the last identifier.

The input parameter is a file name. yyin will read the file in the read mode and passed to the lex.

Page 51: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 51 RVCE

3.Write a lex program to find the given sentence is simple or compound.%{ int flag=0;%}%%(" "[aA][nN][dD]" ")|(" "[oO][rR]" ")|(" "[bB][uU][tT]" ") flag=1;. ;%%main(){ yylex(); if (flag==1)

printf("COMPOUND SENTENCE \n");

Page 52: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 52 RVCE

elseprintf("SIMPLE SENTENCE \n");

}

4. Write a lex program to find the number of valid identifiers.

%{int count=0;%}%% (" int ")|(" float ")|(" double ")|(" char ")

Page 53: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 53 RVCE

{ int ch; ch = input(); for(;;) { if (ch==',') {count++;} else if(ch==';') {break;}

ch = input(); } count++;}%%

Page 54: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 54 RVCE

main(int argc,char *argv[ ]){ yyin=fopen(argv[1],"r"); yylex(); printf("the no of identifiers used is %d\n",count);}

5. Write a lex program to find the number of vowels and consonants.%{ int vowels = 0; int consonents = 0;%}

Page 55: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 55 RVCE

%% [ \t\n]+ [aeiouAEIOU] vowels++; [bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ] consonents++; . %%main(){ yylex(); printf(" The number of vowels = %d\n", vowels); printf(" number of consonents = %d \n", consonents); return(0);}

Page 56: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 56 RVCE

6. Write a Lex program to count the number of words, characters, blanks and lines in a given text.%{ int charcount=0; int wordcount=0; int linecount=0; int blankcount =0;%}word[^ \t\n]+eol \n%%[ ] blankcount++;{word} { wordcount++; charcount+=yyleng;}

Page 57: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 57 RVCE

{eol} {charcount++; linecount++;}. { ECHO; charcount++;}%%main(argc, argv)int argc;char **argv;{ if(argc > 1) {

FILE *file;file = fopen(argv[1],"r");if(!file)

Page 58: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 58 RVCE

{ fprintf(stderr, "could not open %s\n", argv[1]); exit(1); } yyin = file; yylex(); printf("\nThe number of characters = %u\n", charcount); printf("The number of wordcount = %u\n", wordcount); printf("The number of linecount = %u\n", linecount); printf("The number of blankcount = %u\n", blankcount); return(0); } else printf(" Enter the file name along with the program \n");}

Page 59: Dr. N.K Shrinath - Lex and Yacc

N.K. Srinath [email protected] 59 RVCE

Questions

?