Tokenizing - Computer Science and...

12
Tokenizing 19 March 2013 OSU CSE 1

Transcript of Tokenizing - Computer Science and...

Page 1: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizing

19 March 2013 OSU CSE 1

Page 2: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

BL Compiler Structure

19 March 2013 OSU CSE 2

Code Generator Parser Tokenizer

string of characters

(source code)

string of tokens

(“words”)

abstract program

string of integers

(object code)

The tokenizer is relatively easy.

Page 3: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Aside: Characters vs. Tokens

•  In the examples of CFGs, we dealt with languages over the alphabet of individual characters (e.g., Java’s char values) Σ = character

•  Now, we deal with languages over an alphabet of tokens, each of which is a unit that you want to consider as a single entity in the language – Choice of tokens is a design decision

19 March 2013 OSU CSE 3

Page 4: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Example: Expression CFG

expr → expr add-op term | term term → term mult-op factor | factor factor → ( expr ) | digit-seq add-op → + | - mult-op → * | DIV | REM digit-seq → digit digit-seq | digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

19 March 2013 OSU CSE 4

Page 5: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Example: Expression CFG

expr → expr add-op term | term term → term mult-op factor | factor factor → ( expr ) | digit-seq add-op → + | - mult-op → * | DIV | REM digit-seq → digit digit-seq | digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

19 March 2013 OSU CSE 5

Appropriate tokens for this CFG are “words”

consisting of strings of consecutive terminal

symbols (characters) that “belong together”, e.g., "+", "DIV", "5".

Page 6: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizer

•  The job of the tokenizer is to transform a string of characters into a string of tokens

•  Example: –  Input: "4 + (7 DIV 3) REM 5"

19 March 2013 OSU CSE 6

Page 7: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizer

•  The job of the tokenizer is to transform a string of characters into a string of tokens

•  Example: –  Input: "4 + (7 DIV 3) REM 5"

19 March 2013 OSU CSE 7

characters used as terminal symbols of the language

Page 8: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizer

•  The job of the tokenizer is to transform a string of characters into a string of tokens

•  Example: –  Input: "4 + (7 DIV 3) REM 5"

19 March 2013 OSU CSE 8

whitespace characters

Page 9: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizer

•  The job of the tokenizer is to transform a string of characters into a string of tokens

•  Example: –  Input: "4 + (7 DIV 3) REM 5"

19 March 2013 OSU CSE 9

Mathematically, input is a string of character

Page 10: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Tokenizer

•  The job of the tokenizer is to transform a string of characters into a string of tokens

•  Example: –  Input: "4 + (7 DIV 3) REM 5" – Output: <"4", "+", "(", "7", "DIV", "3", ")", "REM", "5">

19 March 2013 OSU CSE 10

Mathematically, output is a string of string of character

Page 11: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Another Example: BL

•  In BL, tokens can be the “words” such as "IF", "next-is-empty", etc.

•  A BL tokenizer is then easy: it can simply treat strings of consecutive whitespace characters as separators between tokens – This makes it easy for the language to allow

line separators, extra spaces and tabs used for indentation, etc., to have no impact on the legality of a program

19 March 2013 OSU CSE 11

Page 12: Tokenizing - Computer Science and Engineeringweb.cse.ohio-state.edu/.../web-sw2/extras/slides/26.Tokenizing.pdf · Aside: Characters vs. Tokens • In the examples of CFGs, we dealt

Resources •  Wikipedia: Lexical Analysis

–  http://en.wikipedia.org/wiki/Lexical_analysis

19 March 2013 OSU CSE 12