COS 320 Compilers

44
COS 320 Compilers David Walker

description

COS 320 Compilers. David Walker. Outline. Last Week Introduction to ML Today: Lexical Analysis Reading: Chapter 2 of Appel. The Front End. Lexical Analysis : Create sequence of tokens from characters Syntax Analysis : Create abstract syntax tree from sequence of tokens - PowerPoint PPT Presentation

Transcript of COS 320 Compilers

Page 1: COS 320 Compilers

COS 320Compilers

David Walker

Page 2: COS 320 Compilers

Outline

• Last Week– Introduction to ML

• Today:– Lexical Analysis– Reading: Chapter 2 of Appel

Page 3: COS 320 Compilers

The Front End

• Lexical Analysis: Create sequence of tokens from characters

• Syntax Analysis: Create abstract syntax tree from sequence of tokens

• Type Checking: Check program for well-formedness constraints

Lexer Parser

stream ofcharacters

stream oftokens

abstractsyntax

TypeChecker

Page 4: COS 320 Compilers

Lexical Analysis

• Lexical Analysis: Breaks stream of ASCII characters (source) into tokens

• Token: An atomic unit of program syntax– i.e., a word as opposed to a sentence

• Tokens and their types:Type:IDREALSEMILPARENNUMIF

Characters Recognized:foo, x, listcount10.45, 3.14, -2.1;(50, 100if

Token:ID(foo), ID(x), ...REAL(10.45), REAL(3.14), ...SEMILPARENNUM(50), NUM(100)IF

Page 5: COS 320 Compilers

Lexical Analysis Examplex = ( y + 4.0 ) ;

Page 6: COS 320 Compilers

Lexical Analysis Examplex = ( y + 4.0 ) ;

ID(x)

Lexical Analysis

Page 7: COS 320 Compilers

Lexical Analysis Examplex = ( y + 4.0 ) ;

ID(x) ASSIGN

Lexical Analysis

Page 8: COS 320 Compilers

Lexical Analysis Examplex = ( y + 4.0 ) ;

ID(x) ASSIGN LPAREN ID(y) PLUS REAL(4.0) RPAREN SEMI

Lexical Analysis

Page 9: COS 320 Compilers

Lexer Implementation• Implementation Options:

1. Write a Lexer from scratch– Boring, error-prone and too much work

Page 10: COS 320 Compilers

Lexer Implementation• Implementation Options:

1. Write a Lexer from scratch– Boring, error-prone and too much work

2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.

Lexer Specification

Page 11: COS 320 Compilers

Lexer Implementation• Implementation Options:

1. Write a Lexer from scratch– Boring, error-prone and too much work

2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.

Lexer Specification

lexergenerator

Lexer

Page 12: COS 320 Compilers

Lexer Implementation• Implementation Options:

1. Write a Lexer from scratch– Boring, error-prone and too much work

2. Use a Lexer Generator– Quick and easy. Good for lazy compiler writers.

Lexer Specification

lexergenerator

Lexer

stream ofcharacters

stream oftokens

Page 13: COS 320 Compilers

• How do we specify the lexer?– Develop another language – We’ll use a language involving regular

expressions to specify tokens

• What is a lexer generator?– Another compiler ....

Page 14: COS 320 Compilers

Some Definitions• We will want to define the language of legal tokens

our lexer can recognize– Alphabet – a collection of symbols (ASCII is an alphabet)– String – a finite sequence of symbols taken from our alphabet

– Language of legal tokens – a set of strings• Language of ML keywords – set of all strings which are ML

keywords (FINITE)• Language of ML tokens – set of all strings which map to ML tokens

(INFINITE)• A language can also be a more general set of strings:

– eg: ML Language – set of all strings representing correct ML programs (INFINITE).

Page 15: COS 320 Compilers

Regular Expressions: Construction

• Base Cases:– For each symbol a in alphabet, a is a RE denoting the

set {a}– Epsilon (e) denotes { }

• Inductive Cases (M and N are REs)– Alternation (M | N) denotes strings in M or N

• (a | b) == {a, b}– Concatenation (M N) denotes strings in M

concatenated with strings in N• (a | b) (a | c) == { aa, ac, ba, bc }

– Kleene closure (M*) denotes strings formed by any number of repetitions of strings in M

• (a | b )* == {e, a, b, aa, ab, ba, bb, ...}

Page 16: COS 320 Compilers

Regular Expressions

• Integers begin with an optional minus sign, continue with a sequence of digits

• Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*

Page 17: COS 320 Compilers

Regular Expressions

• Integers begin with an optional minus sign, continue with a sequence of digits

• Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*

• So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) and even worse (a | b | c | ...) gets tedious...

Page 18: COS 320 Compilers

Regular Expressions (REs)

• common abbreviations: – [a-c] == (a | b | c)– . == any character except \n– \n == new line character– a+ == one or more– a? == zero or one

• all abbreviations can be defined in terms of the “standard” REs

Page 19: COS 320 Compilers

Ambiguous Token Rule Sets

• A single RE is a completely unambiguous specification of a token.– call the association of an RE with a token a “rule”

• To lex an entire programming language, we need many rules– but ambiguities arise:

• multiple REs or sequences of REs match the same string

• hence many token sequences possible

Page 20: COS 320 Compilers

Ambiguous Token Rule Sets

• Example:– Identifier tokens: [a-z] [a-z0-9]*– Sample keyword tokens: if, then, ...

• How do we tokenize:– foobar ==> ID(foobar) or ID(foo)

ID(bar)– if ==> ID(if) or IF

Page 21: COS 320 Compilers

Ambiguous Token Rule Sets

• We resolve ambiguities using two conventions:– Longest match: The regular expression that

matches the longest string takes precedence.– Rule Priority: The regular expressions

identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.

Page 22: COS 320 Compilers

Ambiguous Token Rule Sets

• Example:– Identifier tokens: [a-z] [a-z0-9]*– Sample keyword tokens: if, then, ...

• How do we tokenize:– foobar ==> ID(foobar) or ID(foo) ID(bar)

• use longest match to disambiguate

– if ==> ID(if) or IF • keyword rules have higher priority than identifier rule

Page 23: COS 320 Compilers

Lexer Implementation

Implementation Options:1. Write Lexer from scratch

– Boring and error-prone

2. Use Lexical Analyzer Generator– Quick and easy

ml-lex is a lexical analyzer generator for ML.

lex and flex are lexical analyzer generators for C.

Page 24: COS 320 Compilers

ML-Lex Specification

• Lexical specification consists of 3 parts:

User Declarations (plain ML types, values, functions)

%%

ML-LEX Definitions (RE abbreviations, special stuff)

%%

Rules (association of REs with tokens) (each token will be represented in plain ML)

Page 25: COS 320 Compilers

User Declarations

• User Declarations:– User can define various values that are

available to the action fragments.– Two values must be defined in this section:

• type lexresult– type of the value returned by each rule action.

• fun eof ()– called by lexer when end of input stream is reached.

Page 26: COS 320 Compilers

ML-LEX Definitions

• ML-LEX Definitions:– User can define regular expression

abbreviations:

– Define multiple lexers to work together. Each is given a unique name.

DIGITS = [0-9] +;LETTER = [a-zA-Z];

%s LEX1 LEX2 LEX3;

Page 27: COS 320 Compilers

Rules

• Rules:

• A rule consists of a pattern and an action:– Pattern in a regular expression.– Action is a fragment of ordinary ML code.– Longest match & rule priority used for disambiguation

• Rules may be prefixed with the list of lexers that are allowed to use this rule.

<lexer_list> regular_expression => (action.code) ;

Page 28: COS 320 Compilers

Rules

• Rule actions can use any value defined in the User Declarations section, including– type lexresult

• type of value returned by each rule action

– val eof : unit -> lexresult• called by lexer when end of input stream reached

• special variables:– yytext: input substring matched by regular expression– yypos: file position of the beginning of matched string– continue (): doesn’t return token; recursively calls

lexer

Page 29: COS 320 Compilers

A Simple Lexerdatatype token = Num of int | Id of string | IF | THEN | ELSE | EOFtype lexresult = token (* mandatory *)fun eof () = EOF (* mandatory *)

fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail%%

NUM = [1-9][0-9]*ID = [a-zA-Z] ([a-zA-Z] | NUM)*

%%

if => (IF);then => (THEN);else => (ELSE);{NUM} => (Num (itos yytext));{ID} => (Id yytext);

Page 30: COS 320 Compilers

Using Multiple Lexers

• Rules prefixed with a lexer name are matched only when that lexer is executing

• Initial lexer is called INITIAL • Enter new lexer using:

– YYBEGIN LEXERNAME;

• Aside: Sometimes useful to process characters, but not return any token from the lexer. Use:– continue ();

Page 31: COS 320 Compilers

Using Multiple Lexers

type lexresult = unit (* mandatory *)fun eof () = () (* mandatory *)

%%

%s COMMENT

%%

<INITIAL> if => ();<INITIAL> [a-z]+ => ();<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());<COMMENT> “\n” | . => (continue ());

Page 32: COS 320 Compilers

A (Marginally) More Exciting Lexertype lexresult = string (* mandatory *)fun eof () = (print “End of file\n”; “EOF”) (* mandatory *)

%%

%s COMMENT

INT = [1-9] [0-9]*;

%%

<INITIAL> if => (“IF”);<INITIAL> then => (“THEN”);<INITIAL> {INT} => ( “INT(“ ^ yytext ^ “)” );<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());<COMMENT> “\n” | . => (continue ());

Page 33: COS 320 Compilers

Implementing ML-Lex

• By compiling, of course:– convert REs into non-deterministic finite automata– convert non-deterministic finite automata into

deterministic finite automata– convert deterministic finite automata into a blazingly

fast table-driven algorithm

• you did mostly everything but possibly the last step in your favorite algorithms class– need to deal with disambiguation & rule priority– need to deal with multiple lexers

Page 34: COS 320 Compilers

Refreshing your memory: RE ==> NDFA ==> DFA

Lex rules:if => (Tok.IF)

[a-z][a-z0-9]* => (Tok.Id;)

Page 35: COS 320 Compilers

Refreshing your memory: RE ==> NDFA ==> DFA

Lex rules:if => (Tok.IF)

[a-z][a-z0-9]* => (Tok.Id;)

NDFA:

1 4

2

a-z

f

i

a-z0-9

3Tok.IF

Tok.Id

Page 36: COS 320 Compilers

Refreshing your memory: RE ==> NDFA ==> DFA

Lex rules:if => (Tok.IF)

[a-z][a-z0-9]* => (Tok.Id;)

NDFA: DFA:

1 4

2

a-z

f

i

a-z0-9

3Tok.IF

Tok.Id 1 4

2,4

a-hj-z

f

i

a-z0-9

3,4Tok.IF

Tok.Id

Tok.Id

(could be Tok.Id; decision made by rule priority)

a-eg-z0-9a-z0-9

a-z0-9

Page 37: COS 320 Compilers

Table-driven algorithm

• NDFA:

1 4

2,4

a-hj-z

f

i

a-z0-9

3,4Tok.IF

Tok.Id

Tok.Id

a-eg-z0-9a-z0-9

Page 38: COS 320 Compilers

Table-driven algorithm

• NDFA (states conveniently renamed):

S1 S4

S2

a-hj-z

f

i

a-z0-9

S3Tok.IF

Tok.Id

Tok.Id

a-eg-z0-9a-z0-9

Page 39: COS 320 Compilers

Table-driven algorithm

• DFA: Transition Table:

S4 S4 S4 S4

S4 S4 S4 S4

S2 S4 S4 S4

S1 S2 S3 S4

a

b

...

i

...

S1 S4

S2

a-hj-z

f

i

a-z0-9

S3Tok.IF

Tok.Id

Tok.Id

a-eg-z0-9a-z0-9

Page 40: COS 320 Compilers

Table-driven algorithm

• DFA: Transition Table:

S4 S4 S4 S4

S4 S4 S4 S4

S2 S4 S4 S4

S1 S2 S3 S4

a

b

...

i

...

S1 S4

S2

a-hj-z

f

i

a-z0-9

S3Tok.IF

Tok.Id

Tok.Id

a-eg-z0-9a-z0-9

- Tok.Id Tok.IF Tok.Id

S1 S2 S3 S4

Final State Table:

Page 41: COS 320 Compilers

Table-driven algorithm

• DFA: Transition Table:

S4 S4 S4 S4

S4 S4 S4 S4

S2 S4 S4 S4

S1 S2 S3 S4

a

b

...

i

...

S1 S4

S2

a-hj-z

f

i

a-z0-9

S3Tok.IF

Tok.Id

Tok.Id

a-eg-z0-9a-z0-9

- Tok.Id Tok.IF Tok.Id

S1 S2 S3 S4

Final State Table:

• Algorithm:• Start in start state• Transition from one state to nextusing transition table• Every time you reach a potential finalstate, remember it + position in stream• When no more transitions apply, revert to last final state seen + position• Execute associated rule code

Page 42: COS 320 Compilers

Dealing with Multiple Lexers

Lex rules:<INITIAL> if => (Tok.IF);

<INITIAL> [a-z][a-z0-9]* => (Tok.Id);

<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());

<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());

<COMMENT> . => (continue ());

Page 43: COS 320 Compilers

Dealing with Multiple Lexers

Lex rules:<INITIAL> if => (Tok.IF);

<INITIAL> [a-z][a-z0-9]* => (Tok.Id);

<INITIAL> “(*” => (YYBEGIN COMMENT; continue ());

<COMMENT> “*)” => (YYBEGIN INITIAL; continue ());

<COMMENT> . => (continue ());

(*

COMMENTINITIAL

*)

[a-z][a-z0-9] .

Page 44: COS 320 Compilers

Summary

• A Lexer:– input: stream of characters– output: stream of tokens

• Writing lexers by hand is boring, so we use a lexer generator: ml-lex– lexer generators work by converting REs through

automata theory to efficient table-driven algorithms.

• Moral: don’t underestimate your theory classes!– great application of cool theory developed in the 70s.– we’ll see more cool apps as the course progresses