Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support...

46
Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy Give you an example for Milestone 1. •Submissions: 99 •Average for A2: 71% •Early submission bonus: 1 •Full marks: 5 •16 teams attempted nonce bonus •7 got full marks •7 teams attempted ACC bonus •7 got full marks

Transcript of Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support...

Page 1: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Scanning & Parsing with Lex and YACC

Hans-Arno Jacobsen

ECE 297

Can we generate code to support mundane coding tasks and safe time?

Powerful, but not easy

Give you an example for Milestone 1.

•Submissions: 99•Average for A2: 71%•Early submission bonus: 1•Full marks: 5•16 teams attempted nonce bonus

•7 got full marks•7 teams attempted ACC bonus

•7 got full marks

Page 2: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

CoursePeer – try it out!

• Developed by a former ECE297 student– Many of the videos under tips & tricks are from him too

• Short video about CoursePeer

• To sign up and auto-enrol under ECE297, use this link– http://www.crspr.com/?rid=339

• Will have a quick demo and use it on Wednesday for our Q&A session

Page 3: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Know your tools!

• Can we generate code based on a specification of what we want?

• Is the specification simpler than writing a program for doing the same task?

• Fully automated program generation has been a dream since the early days of computing.

Page 4: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Where do we need parsing in the storage server?

Page 5: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Where do we need parsing in the storage server?

• Configuration file (file)• Bulk loading of data files (file)• Protocol messages (network)

• Command line arguments (string)

Page 6: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Parsing

• default.conf – the way the disk may see it

server_host localhost \n server_port 1111 \n table marks \n # This datadirectory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF

server_host localhost server_port 1111table marks

data_directory ./data

PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE

Tokens

Page 7: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

ScenariosWhere we’d like to safe time in writing a quick language processor?

Conceptually speaking• Languages

– Data description language– Script language– Markup language

• System configurations

• Workload generation

In our storage servers• Languages

– Data schema & data– Query language– Output formatting (Web,

Latex, PDF, Word, Excel)

• Storage server configuration

• Benchmarking

Page 8: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Parser generation from 30K feet

SpecificationSpecification Generator

Generator

Other code

Other code

Generated code

Written by developer

Written by developer

Compiler / LinkerExecut-

able

Page 9: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Scanning & parsing I

PROPERTY

server_host localhost \n server_port 1111 \n table marks \n # This data

PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE

Scanning

Parsing

ProcessingVerify content, add to data structures, …

VALUE PROPERTY VALUE …

Page 10: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Regular expressions

• (TABLE TABLE-NAME)+– TABLE TABLE-NAME– TABLE TABLE-NAME TABLE TABLE-NAME– …

• Regular expressions (formal languages)

• Extended regular expressions (UNIX)

Patterns

Page 11: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Scanning & parsing II

• Parsing is really two steps– Scanning (a.k.a. tokenizing or lexical analysis)– Parsing, i.e., analysis of structure and syntax according to

a grammar (i.e., a set of rules)• flex is the scanner generator (open source)

– Fast Lex for lexical analysis• YACC is the parser generator

– Yet Another Compiler Compiler for structural and syntax analysis

• Lex and YACC work together• Generated scanner drives the generated parser

• We use flex (fast Lex) and Bison (GNU YACC)• There are myriads of other tools for Java, C++, …, some

of which combine Lex/Yacc into one tool (e.g., javacc)

Page 12: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Objectives for today

• Cover the basics of Lex & Yacc

• Everybody should have an appreciation of the potential of these tools

• There is a lot more detail that remains unsaid

• To challenge you

Page 13: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Lex & YACC overview

LexicalAnalyzerinput stream token stream

Structural Analyzertoken stream

Output defined byactions in parser

specification(often an in-memory

representation of input)

server_host localhost \n server_port 1111 \n table marks \n # This data directorymay be an absolute or relative path. \n data_directory ./data \n\n\n \EOF

PROPERTY VALUE PROPERTY VALUE

Page 14: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

LEXICAL ANALYSIS WITH LEX

Page 15: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

You can control the name of

generated file

Lex introduction

flexInput specification

(*.l)

lex.yy.c

C compiler

LexicalAnalyzerinput stream token stream

You generate thelexical analyzer by using flex

flex is fast Lex

Synonyms: lexical

analyzer, scanner, lexer,

tokenizer

Page 16: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Lex• Input specification for lex – the “program”

– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part

• First part: Definitions– Options used by flex inside the scanner– Defines variables & macros– Code within “%{” and “%}” directly copied into the

scanner (e.g., global variables, header files)• Second part: Rules

– Patterns and corresponding actions• Actions are executed when corresponding pattern(s)

matches– Patterns are defined by regular expressions

Page 17: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Parsing the configuration file of Milestone 1

%{#include "config_parser.tab.h"...

%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory

%%

{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);

return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }

. { return yytext[0]; }…

Shorthands for use below config_parser.l

Pattern

Action

Page 18: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

flex pattern matching principles

• Actions are executed when patterns match– Tokens are returned to caller; next pattern …

• Patterns match a given input character or string only once– Input stream is consumed

• flex executes the action for the longest possible matching input– Order of patterns in the spec. is important

Page 19: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

flex regular expressions by example I(Really: extended regular expressions)

`x‘ match the character 'x' `.‘ any character (byte) except newline`[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j'

through 'o', or a 'Z‘`[^A-Z]‘a "negated character class", i.e., any

character EXCEPT those in the class`[^A-Z\n]’ any character EXCEPT an uppercase

letter or a newline

Page 20: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

flex regular expression by example II

`r*‘ zero or more r's, where r is any regular expression

`r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”)‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's‘<<EOF>>' an end-of-file

r is any regular

expression

Page 21: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

flex regular expressions

• There are many more expressions, see manual

• Form complex expressions– E.g.: IP address, names, …

• The expression syntax is used in other tools as well (well worth learning)

Page 22: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Parsing the configuration file of Milestone 1%{#include "config_parser.tab.h"...

%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory

%%

{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);

return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }. { return yytext[0]; }<<EOF>> { return 0; }

config_parser.l

User-defined variable in YACC(conveys token value to YACC)

server_host localhost server_port 1111table marks

data_directory ./data

Page 23: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

PARSING WITH YACC

Page 24: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

YACC introducing

YACCInput specification

(*.y)

y.tab.c

C compiler

Syntax analyzer / parser

token stream, e.g.,via flex

Output defined byactions in parser

specification

From the specified grammar, YACC generates a parser which recognizes

“sentences” according to the grammar

You can control the name of

generated file

Page 25: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

YACC• Input specification for YACC (similar to flex)

– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part

• First part: Definitions– Definition of tokens for the second part and for use by flex– Definition of variables for use by the parser code

• Second part: Rules– Grammar for the parser

• Third part: User code– The code in this part is copied into the parser generated by

YACC

Page 26: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Configuration file parser Milestone 1

%{#include <string.h>#include <stdio.h>

struct table *tl, *t;struct configuration *c;

/* define a linked list of table names */

struct table { char *table_name; struct table *next;};

/* define a structure for the configuration information */

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

Definition sectionconfig_parser.y

Page 27: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Configuration file parser Milestone 1

%}%union{ char *sval; // String value (user defined) int pval; // Port number value (user defined)}%token <sval> STRING%token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY

DDIR_PROPERTY TABLE

%% Definition section cont’d.

config_parser.y

Page 28: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Configuration file parser Milestone 1

property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory

;table_list:

table_list TABLE STRING| TABLE STRING

;

data_directory: DDIR_PROPERTY STRING ;%%

(Grammar) Rules section(simplified)

config_parser.y

Page 29: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

data_directory:

DDIR_PROPERTY STRING { c = (struct configuration *)

malloc(sizeof(struct configuration));

// Check c for NULL

c->data_dir = strdup( $2 ); } ;

config_parser.y

$1 $2

(Grammar) Rules section(details)

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

struct configuration *c;

Page 30: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

property_list:

HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ;

config_parser.y

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

(Grammar) Rules section(details)

struct configuration *c;

Page 31: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

… TABLE STRING TABLE STRING

Configuration file parser Milestone 1

property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory

;table_list:

table_list TABLE STRING| TABLE STRING

;

data_directory: DDIR_PROPERTY STRING ;%%

(Grammar) Rules section(simplified)

config_parser.y

Page 32: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

table_list is a recursive rule

• Example table specification in configuration filetable MyCoursestable MyMarkstable MyFriends

• table_list: table_list TABLE STRING | TABLE STRING ;

• Terminology– table_list is called a non-terminal– TABLE & STRING are terminals

Page 33: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Recursive rule executiontable_list : table_list TABLE STRING

table_list TABLE STRING TABLE STRING

TABLE STRING TABLE STRING TABLE STRING

table MyCoursestable MyMarkstable MyFriends

table MyCourses

table MyMarks table MyCourses

table MyMarks table MyCoursestable MyFriends

table_list: table_list TABLE STRING |TABLE STRING ;

Page 34: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

table_list:

table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; } ;

table

tabletl =

config_parser.y

struct table { char *table_name; struct table *next; };

$1 $2 $3

$1 $2

tl

t->next = tl

tl->next = NULL

t

struct table *tl, *t;

Page 35: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

How to invoke the parser

int main (int argc, char **argv){

FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f;

while( ! feof(yyin) ) { if (yyparse() != 0) {

…yyerror("");exit(0);

}; } fclose(f); } …

• yylex() for calling generated scanner• by default called within yyparse()

Page 36: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

In the Makefile

lexer: config_parser.l${LEX} config_parser.l${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c

yaccer: config_parser.y${YACC} -d config_parser.y${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c

parser: config_parser.tab.o lex.yy.o${CC} ${CFLAGS} ${INCLUDE} -c parser.c${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \

config_parser.tab.o \parser.o

Page 37: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Benefits• Faster development

– Compared to manual implementation• Easier to change the specification and

generate new parser– Than to modify 1000s of lines of code to add,

change, delete an existing feature• Less error-prone, as code is generated• Cost: Learning curve

– Invest once, amortized over 40+ years career

Page 38: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

If you want to know more• Lecture, examples and some recommended

reading are enough to tackle all of the parsing for Milestone 3 & 4

• 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC

• Lectures on Computability and Theory of Computation may also show you these algorithms

Page 39: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.
Page 40: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

A flex specification%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;%}%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }

The Header

The “Guts”:Regular

expressions annotated with

actions

Page 41: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Temporary variable(s)

The header

%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;

%}%%

Special variable• defined in scanner • used in parser• for transferring values associated with tokens to parser

dividing line between

header and rules section

Page 42: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

The rules%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return (DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }

the string associated with the token

the string associated with the token

yytext: the string associated

with the token

Page 43: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

The rules

%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\n] { c = yytext[0]; return(c); }

sets yylval to the character’s

alphabetical order

sets yylval to digit’snumerical value

otherwise simply returns that character;

presumably it’s an operator: +*-, etc.

Page 44: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Simple example

• Implement a calculator which can recognize adding or subtracting of numbers

[linux33]% ./y_calc1+101 = 102[linux33] % ./y_calc1000-300+200+100 = 1000[linux33] %

Page 45: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Example – the Lex part%{#include <math.h>#include "y.tab.h"extern int yylval;%}

%%[0-9]+ { yylval = atoi(yytext);

return NUMBER; }[\t ]+ ; /* Do nothing for white space */\n return 0;/* End of the logic */. return yytext[0];%%

pattern

action

Definitions

Rules

Page 46: Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy.

Example – the Yacc part%token NAME NUMBER

%%

statement: NAME '=' expression

| expression

{ printf("= %d\n", $1); }

;

expression:expression '+' NUMBER

{ $$ = $1 + $3; }

|expression '-' NUMBER

{ $$ = $1 - $3; }

| NUMBER

{ $$ = $1; }

;

Definitions

Rules

Include Yacc library(-ly)