1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

36
Lecture 3 Introduction to JLex: a lexical analyzer generator for Java
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    241
  • download

    2

Transcript of 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

Page 1: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

1

Lecture 3

Introduction to JLex: a lexical analyzer generator for Java

Page 2: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

2

JLex

javacLscanner.class

Lscanner.lex

Ltokens…

input.L

Lscanner.java

charstream

token stream

The role of JLex

Page 3: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

3

JLex Specificationsuser code%% // must at the beginning of a lineJLex directives%% // must at the beginning of a linelexical rules

Each spec file consists of 3 sections, seperated by %%» user code copied to output file» directives include macro and state definitions, among

others.» 3rd section contains the rules of lexical analysis, each

of which consists of three parts: an optional state list, a regular expression, and an action.

Page 4: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

4

The layout of the generated file

%userCode // from 1st section: package, import decls+utility classes%public class %class [implements %implements] { %nternalCode // from %{ … %} directive// 2 constructors %public %class( InputStream is) [throws %initthrow]{ %initCode] // from %init{ … %init} directive …} // and %public %class( Reader is) throws …{…} // main methods for requesting next token% public %type %function() [throws %yylexthrow] { … // if eof => return ( %eofValue ) …}// method to be called after eof encounteredprivate void yy_do_eof ()

[throws %eofthrow]{ ... %eofCode ... } …

}

Page 5: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

5

JLex Directives1 Internal Code to Lexical

Analyzer Class

2 Initialization Code for Lexical Analyzer Class

3 End-of-File Code for Lexical Analyzer Class

4 Macro Definitions

5 State Declarations

6 Character Counting

7 Line Counting

8 Java CUP Compatibility

9 Lexical Analyzer Component Titles

10 Default Token Type: int

11 Default Token Type II: Wrapped Integer

12 YYEOF on End-of-File 13 Newlines and Operating

System Compatibility 14 Character Sets 15 Character Format To and From

File 16 Exceptions Generated by

Lexical Actions 17 Specifying the Return Value on

End-of-File 18 Specifying an interface to

implement 19 Making the Generated Class

Public

Page 6: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

6

Directives for determining the names of various components of

the lexer. The name of the generated class (as well as the

file name)%class className // default is Yylex The interface the lexer class would implement%implements interfaceName The name and return type of the method to get

the next token%function methodName // default is yylex%type typeName // default is Yytoken make the lexer class public%public

Page 7: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

7

Directives for position information

Enabling the counting of character position

%char // private int yychar declared Enabling the counting of line information

%line // private int yyline declared

Notes:

1. yychar and yyline are zero-based.

2. yychar is used to record the position of the beginning of the current token in the input stream.

3. yylength (always enabled) is used to record the length of the text the current token consumes.

Page 8: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

8

Java codes to be put on various parts of the

generated file user code to be put outside the lexer class

[all text from 1st section] // before first %% user code to be put inside the lexer class user code to be put inside the constructors of the

lexer class user code to be put inside the body of the

yy_do_eof() method. value to be return when eof is encountered.

Page 9: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

9

User code to be put inside the lexer class

format:

%{ // at the beginning of line

<internal code>

%} // at the beginning of line

Permit the declaration of variables and methods inside the generated lexer class

Correspond to the %internalCode region.

Page 10: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

10

User Code to be put inside all constructors of the lexer class

format:

%init{ // at the beginning of line

<initCode>

%init} // at the beginning of line Correspond to the %initCode region. Exceptions thrown should be declared by the

directive:

%initthrow{

Exception0 , …, ExceptionN

%initthrow} // corresponds to %initthrow region

Page 11: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

11

Directives for Specifying the input alphabet

%full

%unicode default alphabet is ASCII ( 0~127) %full => 0~255; %unicode => 0 ~65535.

%ignorecase upper case and lower case letters regarded as

the same.

Page 12: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

12

Directives related to eof processing

Specifying the Return Value on End-of-File

%eofval{

eofValue

%eofval} YYEOF on End-of-File

%yyeof notes:

» Enable the decl: public final int YYEOF=-1; in lexer

» implied by the dir: %integer

Page 13: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

13

User Code to be executed when end_of_file is encountered

format:

%eof{ // at the beginning of line

<eofCode>

%eof} // at the beginning of line Correspond to the %eofCode region. Exceptions thrown should be declared by the

directive:

%eofthrow{

Exception0 , …, ExceptionN

%eofthrow} // corresponds to %eofthrow region

Page 14: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

14

Specifying the type of the returned token

%type typeName

%integer // equ to %type int

%intwrap // equ to %type java.lang.Integer

Notes:

1. Default type is Yytoken (need to be declared elsewhere, say, in user code)

2. null will be returned for eof token if the returned type is not primitive.

3. YYEOF (-1) will be returned for %integer.

Page 15: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

15

Java CUP Compatibility %cup this directive makes the generated scanner

conform to the java_cup.runtime.Scanner interface.

has the same effect as the following three directives:

%implements java_cup.runtime.Scanner

%function next_token

%type java_cup.runtime.Symbol

Page 16: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

16

Newlines and Operating System Compatibility

new line represented differently in UNIX and DOS

based OSs.

unix => \n

dos => \r\n The directive %notunix cause the lexer to

recognize either \r or \n as a new line.

Page 17: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

17

Exceptions Generated by Lexical Actions

Format:

%yylexthrow{

Exception0,…,ExceptionN

%yylexthrow} Notes:

1. mapped to the %yylexthrow region.

2. are Exceptions that may be thrown from within the action codes of lexical rules.

Page 18: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

18

State Declarations Format:

%state state0,…, stateN Notes:

1. state0,..stateN must be at the same line.

2. can have more than one %state declarations

3. State names should be valid identifiers

4. Each stateK will be declared as an int constants in the lexer class.

5. A special state YYINITIAL is implicitly declared and the lexer begins its analysis in this state.

Page 19: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

19

Macro Definitions used to name and define sets of strings for later

use of lexical rules. format:MacroName = MacroDefinition Notes:

1. Each macro definition is contained on a single line 2. MacroName should be a valid id (letter|_)(letter|digit|

_)*3. MacroDefinition should be a valid regular expression

to be defined later.4. MacroDefintion may contain other macro expansion

in the form {otherMacroName}, but recursion is not permitted.

Page 20: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

20

Lexical Rules Format:[<state1,…statesN>] expression { actionCode } Notes:1. All stateKs must have been declared by %state.2. the rule will be activated only when the lexer is

in one of the state listed in the state list.» if state list omitted, it is always activated.

3. the intuitive meaning of the rule is as follows:» if the lexer is in one of the state in the list and

the substring from the current position matches the expression, then execute the actionCode.

Page 21: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

21

Conflict resolution What happens If more than one rule matches

strings from its input?

1. Choose the rule that matches the longest string.

2. If more than one rule matches strings of the same length, then choose the rule that is given first in the JLex specification.

Therefore, rules appearing earlier in the specification are given a higher priority by the generated lexer.

Page 22: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

22

Regular Expressions The alphabet for JLex is the Ascii character set,

meaning character codes between 0 and 127 inclusive

non_newline white spaces in expressions is not allowed unless withnin double quotes “ … “ or immediately after \.

metacharacters: are chars with special meanings in JLex regular expressions.

? * + | ( ) ^ $ . [ ] { } “ \ Other chars represent themselves.

Page 23: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

23

Escape sequences for characters

\ddd The character with number (ddd)8

\xdd The character with number (dd)16

\udddd The Unicode character with number (dddd)16. \b Backspace \n newline \t Tab \f Formfeed \r Carriage return \^C Control character(0~31: \^@, \^A,…Z,[,\,],^,_) \c A backslash followed by any other character c

matches itself: Ex: \\, \a, \B, \”, \’, etc. $ denotes the end of a line. . matches any character except the newline, equ to

[^\n].

Page 24: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

24

More on regular expression “…aString…" denotes aString.

» Metacharacters in aString loose their meaning and represent themselves.

» The sequence \" which represents " is the only exception.

» Ex: “ab d\\\”” stands for ab d\\” {name} denote a macro expansion E1E2 : concatenation E1|E2: choice E+ or (E)+ : one or more repetitions of E, E* or (E)* : zero or more repetitions of E. E? or (E)? : zero or one repetitions of E. (E) : (..) is used for grouping.

Page 25: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

25

More on regular expressions [...]

» Square backets denote a class of characters and match any one character enclosed in the backets.

substring inside with special meaning:» {name} : macro expansion» a-b : range of characters from a to b.» “String” means String with metachars loosing

special meaning.» \ means where is any character.» [^Rest] means – [Rest]

Page 26: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

26

More on regular expressinos Ex:

» [a-z] match a,b,…,z.» [^0-9] matches any char but 0,1,…,9.» [\”\\] matches “ or \.» [“a-z”] matches a,- and z.» [-0-9] matches -,0,..,9.» how about [\b\f”\r\t”] ?

Page 27: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

27

Lexical Actions format:{ action } notes: All curly braces contained in action not part of

strings or comments should be balanced. Actions and Recursion: If no return value is returned in an action, the lexical

analyzer will search for the next match from the input stream and returning the value associated with that match.

The lexical analyzer can be made to recur explicitly with a call to yylex(), as in the following code fragment.{ ... return yylex(); ... }

Page 28: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

28

More on lexical actions State transitions are made by the function call.

yybegin(state); Avilable Lexical methods / vars:

String yytext()

Matched portion of the character input stream

int yylength()

length of yytext()

int yychar;

int yyline;

Page 29: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

29

Performance Size of JLex generated Lexer Hand-Written

Lexer

Source File Execution Time Execution Times

177 lines 0.42 seconds 0.53 seconds

897 lines 0.98 seconds 1.28 seconds

The JLex lexical analyzer soundly outperformed the hand-written lexer!!

Page 30: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

30

Exampleimport java.lang.System; class Sample {

public static void main(String argv[]) throws java.io.IOException {

Yylex yy = new Yylex(System.in); Yytoken t; while ((t = yy.yylex()) != null) System.out.println(t); } }

Page 31: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

31

class Utility { public static void assert ( boolean expr ) {

if (false == expr) { throw (new Error("Error: Assertion failed.")); }

} private static final String errorMsg[] = { "Error: Unmatched end-of-comment punctuation.", "Error: Unmatched start-of-comment punctuation.", "Error: Unclosed string.", "Error: Illegal character." }; public static final int E_ENDCOMMENT = 0; public static final int E_STARTCOMMENT = 1; public static final int E_UNCLOSEDSTR = 2; public static final int E_UNMATCHED = 3; public static void error ( int code )

{ System.out.println(errorMsg[code]); } }

Page 32: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

32

class Yytoken { Yytoken ( int index, String text, int line, int

charBegin, int charEnd ) { m_index = index;

m_text = new String(text); m_line = line; m_charBegin = charBegin; m_charEnd = charEnd; } public int m_index; public String m_text; public int m_line; public int m_charBegin; public int m_charEnd; public String toString() { return "Token

#"+m_index+": "+m_text+" (line "+m_line+")"; } }

Page 33: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

33

%% %{ private int comment_count = 0; %} %line %char %state COMMENT ALPHA=[A-Za-z] DIGIT=[0-9] NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]WHITE_SPACE_CHAR=[\n\ \t\b\012]STRING_TEXT= (\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}

+\\)*COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\

n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])* %%

Page 34: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

34

<YYINITIAL> "," { return (newYytoken(0,yytext(),yyline,yychar,yychar+1)); }

<YYINITIAL> ":" { return (new Yytoken(1,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ";" { return (new

Yytoken(2,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "(" { return (new Yytoken(3,yytext(),yyline,yychar,yychar+1)); }…<YYINITIAL> "<>" { return (new Yytoken(15,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "<" { return (new Yytoken(16,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> "<=" { return (new

Yytoken(17,yytext(),yyline,yychar,yychar+2)); }…<YYINITIAL> "|" { return (new Yytoken(21,yytext(),yyline,yychar,yychar+1)); }<YYINITIAL> ":=" { return (new Yytoken(22,yytext(),yyline,yychar,yychar+2)); }

Page 35: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

35

<YYINITIAL> {NONNEWLINE_WHITE_SPACE_CHAR}+ { }

<YYINITIAL,COMMENT> \n { }

<YYINITIAL> "/*" { yybegin(COMMENT);

comment_count = comment_count + 1; }

<COMMENT> "/*" { comment_count = comment_count + 1; }

<COMMENT> "*/" {

comment_count = comment_count - 1;

Utility.assert(comment_count >= 0);

if (comment_count == 0) {yybegin(YYINITIAL);}}

<COMMENT> {COMMENT_TEXT} { }

<YYINITIAL> \"{STRING_TEXT}\" {

String str = yytext().substring(1,yytext().length() - 1);

Utility.assert(str.length() == yytext().length() - 2);

return (new Yytoken(40,str,yyline,yychar,yychar + str.length())); }

Page 36: 1 Lecture 3 Introduction to JLex: a lexical analyzer generator for Java.

36

<YYINITIAL> \"{STRING_TEXT} {

String str = yytext().substring(1,yytext().length());

Utility.error(Utility.E_UNCLOSEDSTR);

Utility.assert(str.length() == yytext().length() - 1);

return (new Yytoken(41,str,yyline,yychar,yychar + str.length()));}

<YYINITIAL> {DIGIT}+ {

return (new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length()));}

<YYINITIAL> {ALPHA}({ALPHA}|{DIGIT}|_)* {

return (new Yytoken(43,yytext(),yyline,yychar,yychar + yytext().length())); }

<YYINITIAL,COMMENT> . {

System.out.println("Illegal character: <" + yytext() + ">");

Utility.error(Utility.E_UNMATCHED);}