Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of...
-
Upload
darryl-hinchliffe -
Category
Documents
-
view
220 -
download
0
Transcript of Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of...
CodeGenerator
Translator Architecture
ParserTokenizer
string ofcharacters
(source code)
string oftokens
abstractprogram
string ofintegers
(object code)
A Tokenizing Machine
Another greatmachine!
Ready to Dispense?
In(Chars)
Out(Tokens)
Tokenizing Machine Continued…
Ready toDispense?
In Out
3: Tokenscome outhere
2: Light comes on
1: Charactersgo in here
Tokenizing Machine Continued… Type
BL_Tokenizing_Machine_Kernel is modeled by (
buffer : string of character ready_to_dispense : boolean
) constraint …
Initial Value (empty_string, false)
Tokenizing Machine Continued…
Operations m.Insert (ch) m.Dispense (token_text, token_kind) m.Is_Ready_To_Dispense () m.Flush_A_Token (token_text, token_kind) m.Size ()
A State-Transition View
not readyto dispense
readyto dispense
Insert
Flush_A_Token
Insert
Size
Size
Is_Ready_To_DispenseIs_Ready_To_Dispense
Dispense
Tokenizing BL Programs
Token Types KEYWORD IDENTIFIER CONDITION WHITE_SPACE COMMENT ERROR
A Very Useful Extensionprocedure_body Get_Next_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){
}
while (not self.Is_Ready_To_Dispense () and not str.At_EOS ()) { object Character ch; str.Read (ch); self.Insert (ch); } if (self.Is_Ready_To_Dispense ()) { self.Dispense (token_text, token_kind); } else { self.Flush_A_Token (token_text, token_kind); }
Another Useful Extensionprocedure_body Get_Next_Non_Separator_Token ( alters Character_IStream& str, produces Text& token_text, produces Integer& token_kind ){
}
self.Get_Next_Token (str, token_text, token_kind); while ((token_kind == WHITE_SPACE) or (token_kind == COMMENT)) { self.Get_Next_Token (str, token_text, token_kind); }
How Does Insert Work?
buffer: “PROGRAM”ready_to_dispense: false
buffer: “PROGRAM ”ready_to_dispense: true
#m:
m:
Here’s anothercharacter.
‘ ’
The Specification of Insert
procedure Insert ( preserves Character ch ) is_abstract; /*! requires self.ready_to_dispense = false ensures self.buffer = #self.buffer * <ch> and self.ready_to_dispense = IS_COMPLETE_TOKEN_TEXT (#self.buffer, ch) !*/
An Important Math Operation
math definition IS_COMPLETE_TOKEN_TEXT (
s: string of character
c: character
): boolean is
( s is in OK_STRINGS and
s * <c> is not in OK_STRINGS ) or
( <c> is in PREFIX (OK_STRINGS) and
s * <c> is not in PREFIX (OK_STRINGS) )
s is a complete“valid” token
c can start a “valid” token, buts*<c> doesn’t start a “valid” token
Other Math Definitions
OK_STRINGS ={s: string of character (IS_KEYWORD (s))} union{s: string of character (IS_IDENTIFIER (s))} union{s: string of character (IS_CONDITION_NAME (s))} union{s: string of character (IS_WHITE_SPACE (s))} union{s: string of character (IS_COMMENT (s))}
PREFIX (s_set) ={x: string of character (there exists y: string of character (x * y is in s_set))}
PREFIX Examples
s_set = {“abc”} PREFIX (s_set) =
{“”, “a”, “ab”, “abc”}
s_set = {“abc”, “de”} PREFIX (s_set) =
{“”, “a”, “ab”, “abc”, “d”, “de”}
Tokenizing Machine: Implementation
Obvious Representation Text buffer_rep Boolean token_ready
Insert (ch)? check if IS_COMPLETE_TOKEN_TEXT
(self[buffer_rep], ch), andset self[token_ready] accordingly
append ch at end of self[buffer_rep]
Tokenizing Machine: Implementation Continued…
Dispense (token_text, token_kind)? set token_text to all but the last
character of self[buffer_rep] set token_kind to the value of
WHICH_KIND (token_text) set self[token_ready] to false
Tokenizing Machine: Implementation Continued…
How do we “check if IS_COMPLETE_TOKEN_TEXT (self[buffer_rep], ch)”?
How do we determine“WHICH_KIND (token_text)”?
How do we do these things quickly?
Making Decisions Quickly
Keep track of the “state” of the buffer by adding one field to the representation: Text buffer_rep Boolean token_ready Integer buffer_state
Possible Buffer States
How many interestingly different buffer states do you think there may be?
Let’s start enumerating them…
Buffer States Continued… Initial state (empty buffer) How many states after inserting
the first character? ‘B’, ‘D’, ‘E’, ‘I’, ‘P’, ‘T’, ‘W’, ‘n’, ‘r’, ‘t’,
identifier (any other letter) white_space (‘ ’, ‘\n’, ‘\t’) comment (‘#’) error (any other character)
Buffer States Continued…
How many states after inserting the second character? “BE”, “DO”, “EL”, “EN”, “IF”, “IN”, “IS”,
“PR”, “TH”, “WH”, “ne”, “ra”, “tr”, identifier (any other id character)
white_space (‘ ’, ‘\n’, ‘\t’) comment (any other character but ‘\n’) error (any character that cannot start a
new “good” token)
A State Transition Diagram:Transitions Out of ‘empty’ Only
B
DE I
PT
W
n
r
t
identifierwhite_space
comment
error
empty
‘B’
‘D’
‘E’ ‘I’‘P’
‘T’
‘W’
‘n’
‘r’
‘t’
any otherletter
‘ ’,’\n’,’\t’‘#’
any othercharacter
Structure of Body of Insertcase_select (self[buffer_state]){ case empty: // case for buffer = empty_string case B: // case for buffer = “B” case D: // case for buffer = “D” case E: // case for buffer = “E” . . . case error: // case for buffer holding an error token}
A Simplified View
Buffer States EMPTY_BS ID_OR_KEYWORD_OR_CONDITION_BS WHITE_SPACE_BS COMMENT_BS ERROR_BS
The State Transition Diagram
EMPTY_BS
ID_OR_KEYWORD_OR_CONDITION_BS
COMMENT_BS
WHITE_SPACE_BS
ERROR_BS
‘ ’, ‘\n’, ‘\t’
‘#’
‘a’..’z’,‘A’..’Z’
any othercharacter
any characterexcept ‘\n’
‘ ’, ‘\n’, ‘\t’
‘a’..’z’,‘A’..’Z’,‘0’..’9’,
‘-’
any character except‘a’..’z’, ‘A’..’Z’, ‘ ’, ‘\n’, ‘\t’, ‘#’
Useful Private Functions Is_White_Space_Character (ch) Is_Digit_Character (ch) Is_Alphabetic_Character (ch) Is_Identifier_Character (ch) Can_Start_Token (ch) Id_Or_Keyword_Or_Condition (t) Buffer_Type (ch) Token_Kind (bs, str)