Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of...
-
Upload
charlene-anthony -
Category
Documents
-
view
245 -
download
0
Transcript of Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of...
Lecture 2: Lexical Analysis
2
Lexical Analysis
INPUT: sequence of characters
OUTPUT: sequence of tokens
A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable
Input Scanner Parser
SymbolTable
Next_char()
character token
Next_token()
3
Definitions
token – set of strings defining an atomic element with a defined meaning
pattern – a rule describing a set of string lexeme – a sequence of characters that
match some pattern
4
Examples
Token Pattern Sample Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters between “ “
“hello”
5
Input string: size := r * 32 + c
<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>
6
Implementing a Lexical Analyzer
Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number
of tokens with single machine Interface to parser Tools
7
Capturing Multiple Tokens
Capturing keyword “begin”
Capturing variable names
What if both need to happen at the same time?
b e g i n WS
WS – white spaceA – alphabeticAN – alphanumericA
AN
WS
8
Capturing Multiple Tokens
b e g i n WS
WS – white spaceA – alphabeticAN – alphanumeric
A-b
AN WS
AN
Machine is much more complicated – just for these two tokens!
9
Lex – Lexical Analyzer Generator
Flex/Lex
C/C++ compiler
a.out
Lexspecification
lex.yy.c
input tokens
10
Lex Specification
%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*%%{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}%%main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}
Definitions – Code, RE
Rules – RE/Action pairs
User Routines
11
Lex definitions section
C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions
RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}
%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*
12
Lex Regular Expressions
Match explicit character sequences integer, “+++”, \<\>
Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric
{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}
13
Alternation twelve | 12
Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}
Lex Regular Expressions(cont.)
14
Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions
Lex Regular Expressions(cont.)
15
Lex Matching Rules
Lex always attempts to match the longest possible string.
If two rules are matched (and match strings are same length), the first rule in the specification is used.
16
Lex Operators
Highest: closure concatenation alternation
Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^Special lex characters inside [ ]: - \ [ ] ^
17
Examples
a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]
18
Lex Actions
Lex actions are C (C++) code to implement some required functionality
Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched
string yytext – matched string yyleng – length of matched string
19
User Subroutines
C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default
main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}
20
Uses for Lex
Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification
Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification.
Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.
21
•A regular expression is a kind of pattern that can be applied to text (Strings, in Java)•A regular expression either matches the text (or part of the text), or it fails to match.• Regular expressions are an extremely useful tool for manipulating text
– Regular expressions are heavily used in the automatic generation of Web pages
Regular expression
22
• Scan for virus signatures• Process natural languages• Search for information using Google• Search and replace in word processors• Filter text( spam, malware )• Validate data-entry field (dates, email, url)
Pattern matching applications:
23
Basic Operation
• Notation to specify a set of strings
24
Regular expression : examples
• Notation is surprisingly expressive.
Regular Expression Yes No
a* | (a*ba*ba*ba*)*multiple of 3 b's
εbbbaaa
abbbaaabbbaababba
a
bbb
abbaaaabaabbba
a
a | a(a|b)*abegins and ends with a
aabaaa
abbaabba
εabba
(a|b)*abba(a|b)*contains the substring
abba
abbabbabbabbabbaabba
εabb
bbaaba
25
Using Regular expression
• Built in to Java, Perl , PHP, Unix, .NET,…. • Additional operations typically aded for
convenience.• Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)*
Operation Regular Expression Yes No
Concatenation hello helloOthello
say helloHello
Any single character ..oo..oo.bloodrootspoonfood
cookbookchoochoo
26
Using Regular expression
Operation Regular Expression Yes No
Replication a(bc)*deade
abcdeabcbcde
abc
One or more a(bc)+deabcde
abcbcdeadeabc
Once or not at all a(bc)?deade
abcdeabc
abcbcde
Character classes [a-m]*blackmailimbecile
abovebelow
Negation of character classes [^aeiou]bc
ae
Exactly N times [^aeiou]{6}rhythmsyzygy
rhythmsallowed
Between M and N times [a-z]{4,6}spidertiger
jellyfishcow
Whitespace characters [a-z\s]*hellohello
say helloOthello2hello