Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

21
Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Transcript of Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Page 1: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lexical Analysis (4.2)

Programming LanguagesHiram CollegeEllen Walker

Page 2: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lexical Analysis is Pattern Matching

• From a sequence of characters to a sequence of lexemes, e.g.– “public static void main(char[] args)” ->– <id> <id> <id> <id> <lparen> <id> <lsquare>

<rsquare> <id> <rparen>

• Patterns are simpler (easy grammars), e.g.<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z

Page 3: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Regular Grammars

• Subset of Context Free Grammars• Every rule contains at most one non-terminal

symbol (or can be rewritten so it does…)

Page 4: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Rewritten Grammar for ID

• Original:<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z

• Rewrite:<id> -> (a | b | c | … | z) <id> | (a | b | c | … z )

• Fully expanded (52 rules):<id> -> a <id> | b <id> | c <id> … a | b | c |… | z

Page 5: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Parsing using a Regular Grammar

1. Transform the grammar into a state machine2. Implement the state machine in a computer

program– By hand– Automatically, using table-lookup

3. Run this program on input strings

Page 6: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

What is a State Machine?

• State machine abstraction– At any time, the process is in a “state”– Each time an “event” happens, the process takes

an “action” and goes to the next state–We can describe the entire algorithm as a diagram

where each state has an arrow for each event/action pair to the next appropriate state

Page 7: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

State Machine for a Kitten

Happy

Hungry Sleeping

Food available / Eat Toys available / Play

X hrs passed / Awaken

Page 8: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

State Machine for a Language

• Each “event” processes an input symbol• Two important special states– Initial state: state the machine is in before the

first symbol– Final state: state the machine is in whenever the

sequence of symbols up to now is in the language

Page 9: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Transforming a Regular Grammar to a State Machine

• Put the grammar into a form so every rule is<nonterm1> -> symbol <nonterm2><nonterm1> -> symbol

• Make a state for each nonterminal• Make a transition (arrow) for each rule. The

transition goes from <nonterm1> to <nonterm2> based on the symbol.

• The start symbol of the grammar is initial.• There is one final state that every rule that

doesn’t have a nonterminal on the right goes to.

Page 10: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

State Machine Example

• <id> -> a <id> | b <id> | a | b

• Two states: id (initial) and f (final)• Example: aabba

Page 11: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Simpler State Machine

• This is a cleaner version of the other machine. Each character, state combination has only one next state.

• It is called a DFA (deterministic finite automaton)

Page 12: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lexical Analysis for Integer Expressions

Page 13: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

From DFA to Program

Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order.

Method lex(s) scans and returns a single Token from a stream.

A Token consists of a type (e.g. INT) and a string (e.g. “1234”)

09/15/10

Page 14: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Defining Constants

• //Number all the statesPublic static final int NUMSTATES = 4;Public static final int START = 0;Public static final int INT = 1;Public static final int ID = 2;Public static final int UNK = 3;Public static final int ERR = 4;

09/15/10

Page 15: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Constructing Transition Table (in constructor)

String chars = “01234abcdef+-()”int[][] tt = new int[[chars.size()][NUMSTATES];tt[ID][5] = ID; // ’a’tt[ID][6] = ID; // ’b’tt[START][5] = ID; // ’a’tt[START][1] = INT;// … etc …tt[ID][0] = ERR;// … etc …

Page 16: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Recognizing Final States

//For this grammar, all states but ERR are final//Usually, this method is a bit more complexboolean final(int state){

return (state != ERR);}

09/15/10

Page 17: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lex Method

//Read one token from the input ( any Scanner)public static Token lex(Scanner s){ //initialize variables StringBuilder lexeme = new StringBuilder; int state = START; char ch = s.nextChar(); …

09/15/10

Page 18: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lex Method (cont’d)

//loop through characters, updating statewhile (state != ERR){ oldstate = state; lexeme += ch; state = tt[oldstate][chars.indexOf(ch)]; ch = s.getChar();}

09/15/10

Page 19: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Lex Method (cont’d)

//return the tokenif final(oldstate) //valid token

return new Token(oldstate,lexeme);else //not a valid token – return the chars

return new Token(ERR, lexeme);} //end of lex()

09/15/10

Page 20: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

From DFA to Program (cont’d)

Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace

eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false;} return true;

Page 21: Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.

Another Program (pp. 176-181)

• Programmed in C (no classes)• Global variables instead of class variables (used in

many functions, e.g. charClass)• Token (int) and lexeme (string) unconnected

• States and transitions are implicit• Lex() is a big case statement• Many special purpose functions, e.g. getChar(),

addChar(), lookup() executing portions of DFA

09/15/10