CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… ·...

63
CSC488/2107 Winter 2019 — Compilers & Interpreters https://www.cs.toronto.edu/~csc488h/ Peter McCormick [email protected]

Transcript of CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… ·...

Page 1: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

CSC488/2107 Winter 2019 — Compilers & Interpreters

https://www.cs.toronto.edu/~csc488h/

Peter [email protected]

Page 2: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Agenda• Recognize, Analyze, Transform

• Lexical analysis

• Building lexical analyzers

Page 3: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Recognize Analyze Transform

Frontend Backend

Page 4: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Recognize• Lexical structure

• Syntactic structure

• Highly language/syntax specific

• Data flow:➥ Stream of Characters➥ Stream of Tokens➥ Parse Tree (Concrete Syntax)

Page 5: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Analyze• Semantic meaning

• Less language specific

• Data flow:➥ Parse Tree➥ Abstract Syntax Tree (possibly with annotations and/or associated symbol tables)

Page 6: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Transform (Lower)•Memory layout

•Optimization (optional)

• Code generation

• Very target specific

• Data flow:➥ Abstract Syntax Tree➥ Intermediate Languages/Representations (optional)➥ Target Machine Code

Page 7: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Lexical Analysis

Syntax Analysis

Semantic Analysis

Code Generation

Source Code

Object Code

CharactersTokens

Parse tree

Intermediate Language

Machine code / Bytecode

Page 8: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

C pre-processor

Page 9: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

#include <stdio.h>

Pre-processed:

/* complete contents of stdio.h */

Post-processed:

Page 10: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

#define PI 3.1415 float pi = PI;

Pre-processed:

float pi = 3.1415;

Post-processed:

Page 11: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Lexical Analysis

Syntax Analysis

Semantic Analysis

Code Generation

Source Code

Object Code

CharactersTokens

Parse tree

Intermediate Language

Machine code / Bytecode

Page 12: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Pre-processor Lexical Analysis

Pre-processor Syntax Analysis

Pre-processor Engine

Source Code

CharactersTokens

Parse tree

Characters

C Lexical Analysis

C Syntax Analysis

Tokens

Parse tree

Page 13: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Lexical Analysis

Page 14: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Recognizing the textual building blocks

of source code

Page 15: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

A scanner or lexer converts a stream of characters

into a stream of lexical tokens

Page 16: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Characters• Visual representation (human):

• ASCII characters or Unicode code points

• Physical byte representation:

• Fixed length: 7 bit ASCII, UCS-4

• Variable length: UTF-8/16/32

• Integers to the compiler

Page 17: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Lexical Token•One of a fixed set of distinguishing categories:

• Identifiers

• Reserved identifiers / keywords

• Literal constants: numeric, string

• Special punctuaction (braces, symbols, etc.)

• Comments

• Language specific

Page 18: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Scanner/Lexer•Consumes character input

•Identifies lexical boundaries

•Emits a stream of tokens

•Identifies malformed input and emits errors

•Chooses what to ignore (comments, whitespace)

•Manages additional bookkeeping like source coordinates (input filenames, line and column numbers)

Page 19: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

if x < y { v = 1 }

Page 20: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

if x < y { v = 1 }

Page 21: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

if x<y{v=1}

Page 22: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

if x < y { v = 1 }IF

IDENT xLT

IDENT yLBRACEIDENT v

EQINTEGER 1RBRACE

Page 23: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Careful language design choices can

enable fast scanners

Page 24: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Building lexical analyzers

Page 25: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

struct Token { enum { IF, LT, IDENT, LITERAL, … } type; union { char *ident; int literal; }; // more bookkeeping };

Page 26: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

data Token = If | Lt | Ident String | Literal Integer | …

Page 27: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Idea: Use finite automata (state machines) to recognize tokens out of a stream of characters

Page 28: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Example: Addition expressions

Page 29: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Example expressions

1 123+456

1+2+3+456

Page 30: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Lexical structure• 2 token types

•Plus

•Positive integer literal

•No whitespace handling

Page 31: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Σ — Vocabulary

Σ = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, + }

Page 32: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Represent finite automata (state machines) using a state transition diagrams

Page 33: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

State transition diagram: Plus

Start Emit Plus

+

Page 34: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

State transition diagram: Positive integer literals

Start XEmit

Literal

1…9

0…9

…else…

Page 35: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Start X Emit Literal

1…9

0…9

…else…

Start Emit Plus

+

Page 36: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

X

Emit Literal

1…9

0…9

Start

Emit Plus

+

λ

λ

Non-deterministic finite automata (NFA)

Page 37: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

XEmit

Literal1…9

0…9

Start

Emit Plus

+

Deterministic finite automata (DFA)

+

Page 38: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Table driven DFA

Input \

State+ 0 1 2 3 4 5 6 7 8 9 Action

S T error U U U U U U U U U

T S+ Emit Plus

U T U+ U+ U+ U+ U+ U+ U+ U+ U+ U+ Emit Literal

Notation: V to change state, V+ to change while consuming 1 input character

Page 39: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

while True: c = curInput() if state is START: if c == '+': emitPlus() nextInput() elif c in digits19: save(c) state = LITERAL nextInput() else: error() elif state is LITERAL: if c in digits09: save(c) nextInput() else: emitLiteral(getSaved()) resetSaved() state = START if c is EOF: break

Page 40: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Regular Expressions• A regular expression is a rigorous mathematic statement

defining the members of a regular set

• Very compact means of specifying the structure of lexical tokens

Page 41: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Notation & Definitions

Page 42: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Let Ø be the empty set

Page 43: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Let Σ be a finite set of distinguished characters

(the vocabulary)

May use quote marks to avoid confusion: Σ = { ‘{‘, ‘}’, ‘,’ }

Page 44: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

A string is defined inductively by cases:

1. The empty or null string, denoted λ

•Ø ≠ λ

2. A character from Σ is itself a string

3. The concatenation of two strings is a string

• For any strings S and T, both S T and T S are strings

• For any string S, λ S = S λ = S

Page 45: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Ø is also a regular expression denoting the empty set

Page 46: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Any string S is a regular expression, denoting the set containing that

string

Page 47: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Forming regular expressionsFor any two regular expressions A and B, the following are also regular expressions:

1. Alternation: A | B

• Set union

2. Concatenation: A B

• Set of all strings formed by the concatenation of any string from A and any string from B

3. Kleene Closure: A*

• Zero or more concatenations of A

4. Parenthesis: ( A )

• Disambiguation

Page 48: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Useful shorthands1. Positive Closure: A+

• A A* (one or more concatenations)

2. Optional: A?

• A | λ (zero or one A)

3. Complement: Not(A)

• Match anything from Σ that does not match A

4. Character ranges: [ “A” … “Z” ]

• “A” | “B” | … | “Y” | “Z”

• When it’s clear what the “…” ranges over

Page 49: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Examples

Page 50: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Addition expressions tokens as regular expressions

Σ = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, + }

Digits19 = ( 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )Digits = ( 0 | Digits19 )Plus = ( + )Literal = ( Digits19 Digits* )

Token = Plus | Literal

Page 51: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

More examples (1)

Digit = “0” … “9” Letter = “a” … “z” | “A” … “Z” Identifier = ( Letter | “_” ) ( Letter | Digit | “_”)*

Page 52: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

More examples (2)

Digit1 = “1” … “9” Digit = “0” | Digit1 HexDigit = Digit | “a” … “f” | “A” … “F” DecLiteral = Digit1 Digit* HexLiteral = “0” ( “x” | “X” ) HexDigit+ Literal = (“-“)? DecLiteral | HexLiteral

Page 53: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

More examples (3)

EOL = ‘\r’ | ‘\n’ | ‘\r’ ‘\n’ PythonComment = ‘#’ Not(EOL)* EOL CComment = ‘/‘ ‘*’ Not(‘*’ ‘/‘)* ‘*’ ‘/‘

Page 54: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Great… but can we use them in a lexer?

Page 55: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

By Thompson’s construction, a regular expression can always

be converted into an NFA

Page 56: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

NFA’s are equivalent to DFA’s

It’s always possible to convert an NFA into a DFA

Page 57: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

DFA scanners can be implemented extremely

efficiently

See re2c for an even faster, code generation approach

Page 58: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Scanner development options

• Write by hand

• Use scanner generator tools

• Provide a specially formatted definition file containing regular expressions and code fragments

• Generates source code implementing the scanner

• GNU Flex, ANTLR, PLY (Python Lex-Yacc), Ragel, re2c

• Use built-in language support for regular expressions

Page 59: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

scanner.py

Page 60: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

if x < y { v = 1 }IF

IDENT xLT

IDENT yLBRACEIDENT v

EQINTEGER 1RBRACE

Page 61: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

import re

SPEC = r''' (?P<IDENT> [_a-zA-Z] [_a-zA-Z0-9]* ) | (?P<NUMBER> [-]? [1-9] [0-9]* ) | (?P<LT> < ) | (?P<EQ> = ) | (?P<LBRACE> { ) | (?P<RBRACE> } ) | (?P<WS> \s+ ) '''

lex = re.compile(SPEC, re.VERBOSE).match

Page 62: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

Next Week• Syntax analysis & parsing

Page 63: CSC488/2107 Winter 2019 — Compilers & Interpreterspdm/csc488/winter2019/lectures/week… · Scanner development options • Write by hand • Use scanner generator tools • Provide

No tutorial on Tuesday Jan 22