Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of...

Lecture 2: Lexical Analysis

2

Lexical Analysis

INPUT: sequence of characters

OUTPUT: sequence of tokens

A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable

Input Scanner Parser

SymbolTable

Next_char()

character token

Next_token()

3

Definitions

token – set of strings defining an atomic element with a defined meaning

pattern – a rule describing a set of string lexeme – a sequence of characters that

match some pattern

4

Examples

Token Pattern Sample Lexeme

while while while

relation_op = | != | < | > <

integer (0-9)* 42

string Characters between “ “

“hello”

5

Input string: size := r * 32 + c

<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>

6

Implementing a Lexical Analyzer

Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number

of tokens with single machine Interface to parser Tools

7

Capturing Multiple Tokens

Capturing keyword “begin”

Capturing variable names

What if both need to happen at the same time?

b e g i n WS

WS – white spaceA – alphabeticAN – alphanumericA

AN

WS

8

Capturing Multiple Tokens

b e g i n WS

WS – white spaceA – alphabeticAN – alphanumeric

A-b

AN WS

AN

Machine is much more complicated – just for these two tokens!

9

Lex – Lexical Analyzer Generator

Flex/Lex

C/C++ compiler

a.out

Lexspecification

lex.yy.c

input tokens

10

Lex Specification

%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*%%{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}%%main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}

Definitions – Code, RE

Rules – RE/Action pairs

User Routines

11

Lex definitions section

C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions

RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}

%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*

12

Lex Regular Expressions

Match explicit character sequences integer, “+++”, \<\>

Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric

{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}

13

Alternation twelve | 12

Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}

Lex Regular Expressions(cont.)

14

Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions

Lex Regular Expressions(cont.)

15

Lex Matching Rules

Lex always attempts to match the longest possible string.

If two rules are matched (and match strings are same length), the first rule in the specification is used.

16

Lex Operators

Highest: closure concatenation alternation

Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^Special lex characters inside [ ]: - \ [ ] ^

17

Examples

a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]

18

Lex Actions

Lex actions are C (C++) code to implement some required functionality

Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched

string yytext – matched string yyleng – length of matched string

19

User Subroutines

C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default

main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}

20

Uses for Lex

Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification

Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification.

Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.

21

•A regular expression is a kind of pattern that can be applied to text (Strings, in Java)•A regular expression either matches the text (or part of the text), or it fails to match.• Regular expressions are an extremely useful tool for manipulating text

– Regular expressions are heavily used in the automatic generation of Web pages

Regular expression

22

• Scan for virus signatures• Process natural languages• Search for information using Google• Search and replace in word processors• Filter text( spam, malware )• Validate data-entry field (dates, email, url)

Pattern matching applications:

23

Basic Operation

• Notation to specify a set of strings

24

Regular expression : examples

• Notation is surprisingly expressive.

Regular Expression Yes No

a* | (a*ba*ba*ba*)*multiple of 3 b's

εbbbaaa

abbbaaabbbaababba

a

bbb

abbaaaabaabbba

a

a | a(a|b)*abegins and ends with a

aabaaa

abbaabba

εabba

(a|b)*abba(a|b)*contains the substring

abba

abbabbabbabbabbaabba

εabb

bbaaba

25

Using Regular expression

• Built in to Java, Perl , PHP, Unix, .NET,…. • Additional operations typically aded for

convenience.• Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)*

Operation Regular Expression Yes No

Concatenation hello helloOthello

say helloHello

Any single character ..oo..oo.bloodrootspoonfood

cookbookchoochoo

26

Using Regular expression

Operation Regular Expression Yes No

Replication a(bc)*deade

abcdeabcbcde

abc

One or more a(bc)+deabcde

abcbcdeadeabc

Once or not at all a(bc)?deade

abcdeabc

abcbcde

Character classes [a-m]*blackmailimbecile

abovebelow

Negation of character classes [^aeiou]bc

ae

Exactly N times [^aeiou]{6}rhythmsyzygy

rhythmsallowed

Between M and N times [a-z]{4,6}spidertiger

jellyfishcow

Whitespace characters [a-z\s]*hellohello

say helloOthello2hello

Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of...

Documents

Transcript of Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of...