Compiler

12
Compiler Definition The name "compiler" is primarily used for programs that translate source code from a high level language to a lower level language or compiler is a computer program (or set of programs) that translates text written in a computer language (the source language) into another computer language (the target language). The original sequence is usually called the source code and the output called object code . Commonly the output has a form suitable for processing by other programs (e.g., a linker), but it may be a human readable text file. Other related terminologies Decompiler The program that converts from a low level language to a higher level is Decompiler Language translator it is a program that translates between high-level languages, usually called a source to source translator, or language converter. Language rewriter is usually a program that translates the form of expressions without a change of language. Compiler operations A compiler is likely to perform many or all of the following operations Lexing Lexical analysis is the processing of an input sequence of characters (such as the source code of a computer program ) to produce, as output, a sequence of symbols called Lexical Tokens or Simply tokens. A lexical analyzer, or lexer for short, can be thought of having two stages, namely

description

In and Out of Compiler

Transcript of Compiler

Page 1: Compiler

CompilerDefinition

The name "compiler" is primarily used for programs that translate source code from a high level language to a lower level language or compiler is a computer program (or set of programs) that translates text written in a computer language (the source language) into another computer language (the target language). The original sequence is usually called the source code and the output called object code . Commonly the output has a form suitable for processing by other programs (e.g., a linker), but it may be a human readable text file.

Other related terminologies

Decompiler The program that converts from a low level language to a

higher level is Decompiler Language translator

it is a program that translates between high-level languages, usually called a source to source translator, or language converter. Language rewriter

is usually a program that translates the form of expressions without a change of language.

Compiler operationsA compiler is likely to perform many or all of the following

operationsLexing

Lexical analysis is the processing of an input sequence of characters (such as the source code of a computer program) to produce, as output, a sequence of symbols called Lexical Tokens or Simply tokens.A lexical analyzer, or lexer for short, can be thought of having two stages, namely

a scanner and an evaluator.

The scanner scans the characters and produces token ( called lexemes)for it generally the white spaces are removed. Then the evaluater converts this scanned lexemes into the Values. These values are further given to the parser

Page 2: Compiler

Lexical analysis

For many languages performing lexical analysis only can be performed in a single pass (ie, no back tracking) by reading a character at a time from the input. This means it is relatively straightforward to automate the generation of programs to perform it and a number of these have been written (eg, flex). However, most commercial compilers use hand written lexers because it is possible to integrate much better error handling into them.

A lexical analyzer, or lexer for short, can be thought of having two stages, namely a scanner and an evaluator. (These are often integrated, for efficiency reasons, so they operate in parallel.)The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases the first non-whitespace character can be used to deduce the kind of token that follows, the input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule). In some languages the lexeme creation rules are more complicated and may involve backtracking over previously read characters A lexeme, however, is onl y a string of characters known to be of a certain kind (eg, a string literal, a sequence of letters).

In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value . The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)For example, in the source code of a computer program the string

net_worth_future = (assets - liabilities);might be converted (with whitespace suppressed) into the lexical token stream:NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON

Page 3: Compiler

Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine (which is plugged into template code for compilation and execution).Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides — unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end.The Lex programming tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection uses hand-written lexers.

[edit] Example lexical analyzerThis is an example of a scanner (written in the C programming language) for the instructional programming language PL/0.The symbols recognized are:'+', '-', '*', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>='numbers: 0-9 {0-9}identifiers: a-zA-Z {a-zA-Z0-9}keywords:"begin", "call", "const", "do", "end", "if", "odd", "procedure", "then", "var", "while"External variables used:

FILE *source -- the source file int cur_line, cur_col, err_line, err_col -- for error reporting int num -- last number read stored here, for the parser

Page 4: Compiler

char id[] -- last identifier read stored here, for the parser Hashtab *keywords -- list of keywords

External routines called: error(const char msg[]) -- report an error Hashtab *create_htab(int estimate) -- create a lookup table int enter_htab(Hashtab *ht, char name[], void *data) -- add an entry to a lookup

table Entry *find_htab(Hashtab *ht, char *s) -- find an entry in a lookup table void *get_htab_data(Entry *entry) -- returns data from a lookup table FILE *fopen(char fn[], char mode[]) -- opens a file for reading fgetc(FILE *stream) -- read the next character from a stream ungetc(int ch, FILE *stream) -- put-back a character onto a stream isdigit(int ch), isalpha(int ch), isalnum(int ch) -- character classification

External types: Symbol -- an enumerated type of all the symbols in the PL/0 language. Hashtab -- represents a lookup table Entry -- represents an entry in the lookup table

Scanning is started by calling init_scan, passing the name of the source file. If the source file is successfully opened, the parser calls getsym repeatedly to return successive symbols from the source file.The heart of the scanner, getsym, should be straightforward. First, whitespace is skipped. Then the retrieved character is classified. If the character represents a multiple-character symbol, additional processing must be done. Numbers are converted to internal form, and identifiers are checked to see if they represent a keyword.

PreprocessingIn computer science, a preprocessor is a program that

processes its input data to produce output that is used as input to another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of fully-fledged programming languages.

A common example from computer programming is the processing performed on source code before the next step of compilation. In some computer languages (eg, C) there is a phase of translation known as preprocessing.

Page 5: Compiler

Lexical pre-processorsLexical preprocessors are the lowest-level of preprocessors,

insofar as they only require lexical analysis, that is, they operate on the source text, prior to any parsing, by performing simple substitution of tokenized character sequences for other tokenized character sequences, according to user-defined rules.

They typically perform macro substitution, textual inclusion of other files, and conditional compilation or inclusion.

Parsing

In computer science, parsing is the process of analyzing a sequence of tokens in order to determine its grammatical structure with respect to a given formal grammar. It is formally named syntax analysis.

A parser is a computer program that carries out this task. The name is analogous with the usage in grammar and linguistics.

The term parseable is generally applied to text or data which can be parsed. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Generally, parsers operate in two stages, first identifying the meaningful tokens in the input, and then building a parse tree from those tokens.

Types of parsersThe task of the parser is essentially to determine if and how the input can be derived from the start symbol within the rules of the formal grammar. This can be done in essentially two ways:

Top-down parsing - A parser can start with the start symbol and try to transform it to the input. Intuitively, the parser starts from the largest elements and breaks them down into incrementally smaller parts. LL parsers are examples of top-down parsers.

Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing

Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).

Page 6: Compiler

Examples of parsersTop-down parsersSome of the parsers that use top-down parsing include:

Recursive descent parser LL parser Packrat parser Unger parser Tail Recursive Parser

Bottom-up parsersSome of the parsers that use bottom-up parsing include:

precedence parsing BC (bounded context) parsing LR parser

o SLR parser o LALR parser o Canonical LR parser o GLR parser

Earley parser CYK parser

Parsing Concepts Chart parser Compiler-compiler Deterministic parsing Lexing Shallow parsing

Parser Development Software ANTLR Bison Coco/R DMS Software Reengineering Toolkit GOLD JavaCC Lemon Parser Lex LRgen Rebol SableCC Spirit Parser Framework Yacc

Semantic analysisIn linguistics, Linguistics, the scientific study of human language semantic

analysis is the process of relating syntactic structures, from the levels

Page 7: Compiler

of phrases, clauses, sentences, and paragraphs to the level of the text as a whole, to their language-independent meanings, removing features specific to particular linguistic and cultural contexts, to the extent that such a project is possible. The elements of idiom and figurative speech, being cultural, must also be converted into relatively invariant meanings.

Semantic analysis is a pass by a compiler that adds semantic information to the parse tree and performs checks based on this information. It logically follows the parsing phase, in which the parse tree is generated, and logically precedes the code generation phase, in which executable code is generated. ( In a compiler implementation, it may be possible to fold different phases into one pass ). Typical examples of semantic information that is added and checked is typing information (type checking) and the binding of variable and function names to their definitions (object binding). Filling out entries of the symbol table is an important activity in this phase.

Code optimizationsCompiler optimization is the process of tuning the output of a compiler to minimize some attribute (or maximize the efficiency) of an executable program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimise the amount of memory occupied, and the growth of portable computers has created a market for minimizing the power consumed by a program. It has been shown that some code optimization problems are NP-complete. In practice factors such as programmer willingness to wait for the compiler to complete its task place upper limits on the optimizations that a compiler implementor might provide (optimization is a very CPU and memory intensive process). In the past computer memory limitations were also a major factor in limiting which optimizations could be performed.Compiler vendors often advertise their products as being optimizing compilers and the ability of a compiler to optimize code can affect its sales and its reputation among programmers.

Code generation

Programming languagesThe most common use of parsers is to parse computer programming languages. These have simple grammars with few exceptions. Parsers

Page 8: Compiler

for programming languages tend to be based on context-free grammars because fast and efficient parsers can be written for them. However, context-free grammars are limited in their expressiveness because they can describe only a limited set of languages. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out. It is usually easy to define a context-free grammar which includes all desired language constructs; on the other hand, it is often impossible to create a context-free grammar which admits only the desirable constructs. Parsers are usually not written by hand but are generated by parser generators. Overview of processThe following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^ and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.The next stage is syntactic parsing or syntax analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars.The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator, the action is to evaluate the expression; a compiler, on the other hand, would generate code. Attribute grammars can also be used to define these actions

Types of Compilers

1) Native Vs Cross environment

Page 9: Compiler

2) One pass Vs Multipass

Native Compilers: The compiler available on host machine are intended for the particular processors, e.g. Windows NT is intended for the Pentium Processors so the programs compile on it are understandable to the Pentium. This is called native compiler

Cross Compilers:If the processor changes like 8051, Motorola 68000 or MIPS or ARM then the program compiled on the host machine are not understandable to the above processors so the compiler that runs on our host machine but produced the binary code understandable to other processors are called cross compilers..

Cross AssemblerAs the name suggests it works on the host machine but

generated machine or binary code for the target. The input is assembly language instead of C or any high level language

Some facts about cross compiling

The program that is running perfectly on host, need not to be work on the target correctly, may be due to one of the following reason.

o The int declaration done on Host is different from targeto Structures may packed differently on both machineo Memory accessing methods are different specifically for

the target and the hostWhich may give you different set of warnings for the same the program on different machines.The ability to compile in a single pass is often seen as a benefit because it simplifies the job of writing a compiler and one pass compilers are generally faster than multi-pass compilers.