Prof. Dr. Torsten Grust Database Systems Research Group U ......1.1.3 lex– A Scanner Generator I...

44
Database Languages and their Compilers Prof. Dr. Torsten Grust Database Systems Research Group U Tübingen Winter 2010 © 2010 T. Grust · Database Languages and their Compilers

Transcript of Prof. Dr. Torsten Grust Database Systems Research Group U ......1.1.3 lex– A Scanner Generator I...

  • Database Languages and their Compilers

    Prof. Dr. Torsten Grust

    Database Systems Research GroupU Tübingen

    Winter 2010

    © 2010 T. Grust · Database Languages and their Compilers

  • 1 Query Parsing

    1.1 The Scanner (Lexer)

    I On the most basic level, any query input is nothing but a stream of characters,including whitespace (e.g., ~; �~; �p~);I a scanner (or lexer) drops whitespace, and maps sequences of characters to tokens( returns, calls):

    S~ E~ L~ E~ C~ T~~ : : : ScannerS~::: ParserhSELECTi:::next tokennext harSymbol Table

    Bsymbol?

    I The parser drives the whole scanning and parsing process.© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 28

  • I Example tokenization:SELECT *

    FROM R

    WHERE A = 42

    � S~ E~ L~ E~ C~ T~~ *~ �~F~ R~ O~ M~~ R~ �~W~ H~ E~ R~ E~~ A~~ =~~ 4~ 2~ �~ scan9 9 K hSELECTi hSTARihFROMi hID; "R"ihWHEREi hID; "A"i hEQi hNUM; 42i

    I The SQL scanner will generate a number of different tokens which can beclassified as follows (with b :: B;n :: N; s :: S)3:Reserved words Symbols Constants IdentifiershSELECTi; hSUMi; hANDi hSTARi; hDOTi; hEQi hBOOL; bi; hNUM; ni; hSTRING; si hID; siI Reserved words (and symbols) make the parsing of a language a lot simpler:� SELECT FROM, SELECT FROM WHERE, SELECT, FROM WHERE SELECT = FROM� The scanner uses a symbol table to recognize reserved words in the input:symbol? ("SELECT") 9 9 Ktrue symbol? ("Mail") 9 9 Kfalse :

    3b, n, and s are known as semantic token values.© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 29

  • AH

    and-

    Cra

    fted

    Sca

    nner

    whil

    e

    truedo nexthar();

    1if

    2f~;�~;

    p

    ~

    gthen;

    2if

    2f0~:::

    9~

    gthen

    n val();

    nexthar();

    whil

    e

    2f0~:::

    9~

    gdo

    n 10�n+val();

    nexthar();

    pushbak

    (c);

    return

    hNUM;ni;3

    if

    2f.~;,~;=~

    ;~;+~;-~

    ;*~;/~gthen

    s ; if2f~

    gthen nexthar();

    if

    2f=~;>~gthen

    s s�;else

    pushbak();

    return

    lookup(s)

    ;

    4if

    2fa~:::

    z~

    ;A~:::Z~;~

    gthen

    s ; nexthar;

    whil

    e

    2fa~:::

    z~

    ;A~:::Z~;~

    ;0~:::9~g

    do

    s s�; nexthar();

    pushbak

    (c);

    if

    symbol?(s)then

    return

    lookup(s)

    ;else

    return

    hID;si;©

    2010

    T.G

    rust

    ·D

    atab

    ase

    Lan

    guag

    esan

    dth

    eir

    Com

    piler

    s:1.

    Quer

    yPar

    sing

    30

  • ✎Som

    ere

    mar

    kson

    nexthar();pu

    shbak();loo

    kup(s)©

    2010

    T.G

    rust

    ·D

    atab

    ase

    Lan

    guag

    esan

    dth

    eir

    Com

    piler

    s:1.

    Quer

    yPar

    sing

    31

  • I This scanner implementation makes use of recurring patterns of code:� Check for presence of a specific character c~ (or a set of characters):if 2 f~; �~; �p~g then

    . . . ;� Check for a sequence of specific characters:if 2 f ~g then nexthar();

    if 2 f =~; >~g then. . . ;� Check for repetitions of the same character (or set of characters):

    while 2 f a~ : : : z~; A~ : : : Z~;~; 0~ : : : 9~g do. . . ; nexthar();

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 32

  • 1.1.1 Regular Expressions

    ! A Regular Expression Pattern (Reg.Ex.). . .matches a well-defined set of character strings. Any reg.ex. is built byusing the building blocks listed below.

    I Let � be an ordered set of characters (e.g., the ASCII code set). A reg.ex. over �

    takes one of the following forms:

    1○ c~ c~ 2 �, matches the character c~ only;2○ [ c~� c0~℄ c~; c0~ 2 �, matches any character in f c~ : : : c0~g;3○ [^ c~� c0~℄ c~; c0~ 2 �, matches any character not in f c~ : : : c0~g;4○ re re 0 re ; re 0 reg.ex., matches any sequence of strings matched by re and re 0;5○ re� re reg.ex., matches 0 or more repetitions of strings matched by re ;6○ re j re 0 re ; re 0 reg.ex., matches any string matched by re or re 0;7○ " matches the empty string "" only;8○ (re) re reg.ex., matches exactly the strings matched by re .

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 33

  • I Some handy abbreviations are in common use, e.g.:re+ � re re�re ? � re j ": � [ c~� c0~℄ with f c~ : : : c0~g = �[ c~� c0~ d~� d0~℄ � [ c~� c0~℄ j [ d~� d0~℄Example 1.1I A crude reg.ex. to match e-mail addresses like "[email protected]"

    (with alpha � [ a~� z~ A~� Z~~℄, alphanum � alpha j [ 0~� 9~℄):alphanum+ @~ (alphanum+ .~)+ alpha+

    I Matching a floating point literal like "123.45E-10" (with digit � [ 0~� 9~℄):( +~ j -~)? ((digit� .~digit+) j (digit+ ( .~ digit�)?)) ( E~ ( +~ j -~)? digit+)?

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 34

  • ✎ Matching a C-style commentOur SQL processor will accept (and ignore) C-style /*. . . */ comments.How could a suitable reg.ex. look like?

    1.1.2 Regular Expressions for SQLI The set of reg.ex. needed to identify SQL tokens is actually quite simple:Reserved words (and symbols) S~ E~ L~ E~ C~ T~ G~ R~ O~ U~ P~ws+ B~ Y~

  • Rob Pike’s Regular Expression Matcher

    1 /* match: search for regexp anywhere in the text */2 int match (char *regexp, char *text)3 {4 if (regexp[0] == ’^’)5 return matchhere(regexp + 1, text);6 do { /* must look even if string is empty */7 if (matchhere(regexp, text))8 return 1;9 } while (*text++ != ’\0’);10 return 0;11 }12

    13 /* matchhere: search for regexp at the beginning of text */14 int matchhere(char *regexp, char *text)15 {16 if (regexp[0] == ’\0’)17 return 1;18 if (regexp[1] == ’*’)19 return matchstar(regexp[0], regexp + 2, text);20 if (regexp[0] == ’$’ && regexp[1] == ’\0’)21 return *text == ’\0’;22 if (*text != ’\0’ && (regexp[0] == ’.’ || regexp[0] == *text))23 return matchhere(regexp + 1, text + 1);24 return 0;25 }

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 36

  • 1

    2 /* matchstar: leftmost longest search for c*regexp */3 int matchstar(int c, char *regexp, char *text)4 {5 char *t;6

    7 for (t = text; *t != ’\0’ && (*t == c || c == ’.’); t++)8 ;9 do { /* * matches zero or more */10 if (matchhere(regexp, t))11 return 1;12 } while (t-- > text);13 return 0;14 }

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 37

  • I Obviously, reg.ex. provide a flexible tool to concisely specify scanners. Apossible nexttoken() routine:Build a table with entries re i 7! hti; nii; i = 1 : : : ;while nexthar() 6= eof dopushbak();

    Call nexthar() repeatedly and match the input against all re i; if input s matches re j then

    return htj; nji;else

    Call some default action;

    � N.B. nexttoken() repeatedly scans the input for the longest match it can find(consider the SQL reg.exs. O~ R~ and O~ R~ D~ E~ R~ws+ B~ Y~).� In general, the value of nj will depend on the actual text matched against re j

    (consider entry [ 0~� 9~℄+ 7! hNUM; ni ):if input s matches rej then

    return htj; nj(s)ielse

    . . . ;

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 38

  • 1.1.3 lex – A Scanner GeneratorI The scanner generator lex implements this approach:� input: a table of entries re i 7! hti; nii;� output: a C routine yylex() that implements a scanner for the specifiedlanguage.I A lex input file consists of a series of definitions and the actual table entries

    (separated by %%).I In an action, the actual matched input s is available in the variablechar *yytext, it’s length in int yyleng.I A semantic token value has to be assigned to the global variable int yylval.The entry re j 7! htj; nj(s)i, is thus implemented by:re j { yylval = nj(yytext); return tj; }

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 39

  • 1 %{

    2 #include

    3 %}

    4

    5 digit [0-9]

    6 ws [ \t\n]

    7 identifier [a-zA-Z_][a-zA-Z_0-9]*

    8 other .

    9

    10 %%

    11

    12 "’"([^’\\\n]|\\.|\\\n)*"’" { *(yytext+yyleng-1) = ’\0’;

    13 yylval = (int)yytext+1; return STRING; }

    14 {digit}+ yylval = atoi(yytext); return NUM;

    15 "TRUE" | "FALSE" yylval = *yytext == ’T’; return BOOL;

    16 "SELECT" return SELECT;

    17 "GROUP"{ws}+"BY" return GROUPBY;

    18 {identifier} yylval = (int)yytext; return ID;

    19 "/*"([^/]|[^*]"/")*"*"+"/" | {ws}+ ;

    20 {other} return *yytext;

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 40

  • ✎ lex Input FileA few details about the lex input snippet printed on the previous slideand a live demo of the same.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 41

  • 1.2 Syntax Analysis (Parsing)

    I We will use context-free grammars, a set of rules precisely describing the validqueries, to implement a parser for SQL.I The parser will� implement the syntax check;� generate a parse tree representing the input query structure:

    S~ E~ L~ E~ C~ T~~ : : : ScannerS~::: ParserhSELECTi::: Parse TreeDecoration

    ÆÆ ÆÆ Æparsenext tokennext har

    Symbol TableBreserved?

    I Such parsers are easy to specify, extensible, and may be derived from theassociated grammar in a fully automated fashion.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 42

  • I Although the scanner emits a linear token sequence, the parser has to reveal theimplicit tree-shaped structure:4� 10 + 2 * 42 scan9 9 K hNUM; 10i h+i hNUM; 2i h*i hNUM; 42i

    The well-known language of mathematics imposes the following structure on

    the above token sequence:h+ihNUM; 10i h*ihNUM; 2i hNUM; 42i 6=h*ih+ihNUM; 10i hNUM; 2ihNUM; 42i

    ✎ Dangling TokensWill there be a problem in determining the tree-shape for the followingquery?

    SELECT *

    FROM R; SELECT *FROM S

    WHERE A = 42

    4The input language’s semantics determines the tree shape for a token sequence.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 43

  • 1.2.1 Context-Free GrammarsI A list of production rules facilitates query syntax checking as well as parse treeconstruction.� Production rule i: Si0 ! Si1 Si2 : : : Sin� Symbols Sij (j � 0) are either terminal (i.e. tokens) or nonterminal;� the left-hand symbol Si0 is always nonterminal;� Si0 may appear on the right-hand side of its production rule (recursion).I A list of production rules forms a context-free grammar iff� there is at least one rule of the form Si ! : : : for all nonterminal symbols Si

    occurring on the right-hand side of a rule;� one nonterminal symbol is marked as the start symbol(conventionally, S00 is the start symbol).

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 44

  • I The following context-free grammar describes the syntax of (simplified) SQLSELECT–FROM blocks: sf ! hSELECTi proj list hFROMi from listproj list ! exprproj list ! proj list h; i proj listfrom list ! hID; ifrom list ! from list h; i from listexpr ! hID; iexpr ! hNUM; iexpr ! expr h+i exprexpr ! expr h*i exprexpr ! h(i expr h)iI sf ; proj list ; from list ; expr are the nonterminal symbols of this grammar; sf isthe start symbol.

    N.B. The grammar is indifferent to the semantic token values (hNUM; i vs. hNUM; ni).© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 45

  • 1.2.2 Derivations and Parse TreesI Using this context-free grammar, we can perform a syntax check for a given query(token sequence) q as follows:d S00;while 9 nonterminal symbol Si0 in d do

    Choose any rule Si0 ! Si1 Si2 : : : Sin;Replace Si0 by Si1 Si2 : : : Sin in d;

    return d = q;� If this procedure returns true , q is a syntactically valid SELECT–FROM block;� if we list d’s values before each replacement step we obtain a derivation for q. Choice of RuleFor one grammar and one query q there may be many (or none) possiblederivations.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 46

  • I One possible derivation for the query below is shown here (::::::::::::::::::::::

    nonterminal::::::::::::::

    symbol

    chosen for next replacement):

    SELECT A; A + 1FROM R; S

    &

    sf

    :::

    &

    hSELECTi proj list hFROMi from list

    ::::::::::::::

    &

    hSELECTi proj list hFROMi from list h; i from list::::::::::::::

    &

    hSELECTi proj list hFROMi from list::::::::::::::

    h; i hID; "S"i&

    hSELECTi proj list

    :::::::::::::

    hFROMi hID; "R"i h; i hID; "S"i&

    hSELECTi proj list h; i proj list:::::::::::::

    hFROMi hID; "R"i h; i hID; "S"i

    &

    hSELECTi proj list h; i expr:::::::

    hFROMi hID; "R"i h; i hID; "S"i

    &

    hSELECTi proj list h; i expr h+i expr:::::::

    hFROMi hID; "R"i h; i hID; "S"i

    &

    hSELECTi proj list h; i expr:::::::

    h+i hNUM; 1i hFROMi hID; "R"i h; i hID; "S"i

    &

    hSELECTi proj list:::::::::::::

    h; i hID; "A"i h+i hNUM; 1i hFROMi hID; "R"i h; i hID; "S"i

    &

    hSELECTi expr:::::::

    h; i hID; "A"i h+i hNUM; 1i hFROMi hID; "R"i h; i hID; "S"ihSELECTi hID; "A"i h; i hID; "A"i h+i hNUM; 1i hFROMi hID; "R"i h; i hID; "S"i

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 47

  • I We can modify our syntax check procedure for a query q to automaticallyproduce a parse tree as a side-effect during derivation:d S00;t S00;while 9 nonterminal symbol Si0 in d do

    Choose any rule Si0 ! Si1 Si2 : : : Sin;Replace Si0 by Si1 Si2 : : : Sin in d;Replace Si0 by Si0Si1 Si2 : : : : : : Sin in t;

    return (d = q; t);

    I These parse trees provide exactly the tree-shaped query structure we wereaiming for in this section.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 48

  • I The derivation for q shown earlier corresponds to the following parse tree:sfhSELECTi proj listproj listexprhID; "A"ih; i proj listexprexprhID; "A"i h+i exprhNUM; 1i

    hFROMi from listfrom listhID; "R"i h; i from listhID; "S"i

    I For a successfull syntax check, a pre-order traversal of the parse tree’s leaves5equals q.

    5Also known as the tree’s front.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 49

  • 1.2.3 Ambiguous GrammarsI There are grammars for which we can derive different parse trees for one input;I obviously, this is a bad thing: parsers use parse trees to derive meaning (see thebeginning of our discussion of parsers).� Given the simplified SELECT–FROM block grammar (using expr as the start

    symbol), we can derive two parse trees for the input 10 + 2 * 42:exprexprhNUM; 10i h+i exprexprhNUM; 2i h�i exprhNUM; 42i

    exprexprexprhNUM; 10i h+i exprhNUM; 2ih�i exprhNUM; 42i

    � SQL’s notion of the usual operator precedence is not reflected in the grammar.© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 50

  • ✎ Comma-Separated ListsHave a closer look at the (sub-)parse tree (start symbol from list) for therelation list R, S, T in the SQL query

    SELECT A

    FROM R; S; T :Any ambiguity here? If there is, do such list-like constructs deserve specialattention?

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 51

  • I We can encode operator precedence in a context-free grammar.1○ Identify the precedence levels and the operators in each level (operators on

    higher levels bind tighter):

    level operator(s)... ...6 * / : : :5 + - : : :... ...

    2○ Introduce new separate production rules for each precedence level:expr ! expr h+i term level 5expr ! expr h-i termexpr ! termterm ! term h�i fator level 6term ! term h=i fatorterm ! fatorfator ! hID; ifator ! hNUM; ifator ! h(i expr h)i

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 52

  • I Leftmost derivation and parse tree for the expression 10 + 2 * 42 (start symbolexpr):&

    expr

    &

    expr h+i term

    &

    fator h+i term

    &

    hNUM; 10i h+i term

    &

    hNUM; 10i h+i term h*i fator

    &

    hNUM; 10i h+i fator h*i fator&

    hNUM; 10i h+i hNUM; 2i h*i fatorhNUM; 10i h+i hNUM; 2i h*i hNUM; 42i

    exprexprfatorhNUM; 10ih+i termtermfatorhNUM; 2i

    h�i fatorhNUM; 42i

    ✎ Precedence captured?How does this grammar transformation capture operator precedence?How about associativity of operators (remember 10 - 20 - 30)?

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 53

  • 1.2.4 Predictive Parsing and Recursive DescentI Once we have constructed a context-free grammar for a language, we can derive aparser program from it using a simple strategy.

    In essence, the production rules for nonterminal symbol Si0 are turned into a singleprocedure Si0(). Si0() is recursive whenever the production rules for S are:

    Si0 ! Si1 Si2 : : : Sin1○ Sij nonterminal: call procedure Sij();2○ Sij terminal hti: call eat(hti).eat(hti):if lookahead = hti thenlookahead nexttoken();elseerror();3○ Parsing starts by executing: lookahead nexttoken(); S00();

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 54

  • I Translate the following excerpt of the SQL grammar into a parser program:...quanti�er ! hFORALLi var hINi oll h:i onditionquanti�er ! hEXISTSi var hINi oll h:i onditionvar ! hID; i

    ...

    Procedure quanti�er():switch lookahead do

    case hFORALLieat(hFORALLi); var (); eat(hINi); oll(); eat(h:i); ondition();case hEXISTSieat(hEXISTSi); var (); eat(hINi); oll(); eat(h:i); ondition();otherwiseerror ();

    Procedure var():eat(hID; i);© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 55

  • ✎ Excerpt: lex-based predictive recursive descent SQL parserA recursive descent parser (using a lex-built scanner) for a subset of SQLclauses: quanti�er ! hFORALLi var hINi oll h:i onditionquanti�er ! hEXISTSi var hINi oll h:i onditionoll ! table refondition ! var rel varrel ! h=irel ! hirel ! h=irel ! hivar ! hID; itable ref ! hID; i

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 56

  • I Constructing a recursive descent parser for a given grammar seems simpleenough. Let us add the VALUES SQL clause:oll ! table refoll ! oll onstoll onst ! hVALUESi oll elemsoll elems ! onstoll elems ! oll elems h; i onstonst ! hNUMiProcedure oll elems():switch lookahead do

    case hNUMi 1○onst(); case hNUMi 2○oll elems(); eat(h; i); onst();

    otherwiseerror ();© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 57

  • I Procedure oll elems() faces a dilemma: given only one lookahead token (::::::::::::::

    marked

    below) how can it determine the proper case?

    Input Case: : : VALUES 42::::

    1○: : : VALUES 2::, 10, 42 2○

    ! A single lookahead token must sufficePredictive parsing will only work if knowledge of the next terminal symbol(the lookahead) is sufficient to determine which production rule to usenext.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 58

  • I Given (the right-hand side of) a production rule Si0 ! Si1 : : : Sin,FIRST (Si1 : : : Sin) is the set of all tokens that can occur as the first token in anytoken sequence derivable from Si1 : : : Sin.� Given this grammar term ! term h�i fatorterm ! term h=i fatorterm ! fatorfator ! hID; ifator ! hNUM; ifator ! h(i term h)i

    we have FIRST (term h�i fator) = fhID; i; hNUM; i; h(ig :

    Computing FIRSTDetermining the FIRST set looks very simple. While computingFIRST (Si1 Si2 Si3) it seems as if Si2 and Si3 can be ignored andFIRST (Si1) is all that matters. This may not be true. (Why?)

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 59

  • I Clearly, if we are given production rulesSi0 ! �1Si0 ! �2

    with hti 2 FIRST (�1) \ FIRST (�2), a predictive parser will not know what to do ifthe lookahead is hti.� Production rules for oll elems :oll elems ! onstoll elems ! oll elems h; i onstonst ! hNUMiFIRST (onst) \ FIRST (oll elems h; i onst) = fhNUMig.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 60

  • I Observe that the core of our problem of common elements in FIRST sets is theleft-recursion in the production rule for oll elems .� A production rule is said to be left-recursive if it takes the following general

    form: S ! S �1S ! �2where �2 does not start with S.

    ✎ Eliminating left-recursionFortunately, we can get rid of left-recursion in rules all together by asimple rule transformation:S ! S �1S ! �2 9 9 K

    � The modified rules pose no problem for a predictive parser.© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 61

  • I A similar problem arises when two or more production rules for the samenonterminal S start with the same symbols:sfw ! SELECT proj list FROM from list| {z }=sfw ! z }| {SELECT proj list FROM from list WHERE ondition� Only after the parser has seen the last token of the expansion of from list it

    can decide which rule to choose (lookahead = hWHEREi?).This does not go together with predictive one symbol lookahead parsing.

    ✎ Left-factorizationOnce more we can transform the production rules and make predictiveparsing possible again:S ! �1 �2S ! �1 �3 9 9 K

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 62

  • 1.2.5 yacc – A Parser GeneratorI Once we have come up with a grammar for a language, it is simple enouh toautomatically derive a parser from it. This is exactly what yacc (Yet Another

    Compiler Compiler) provides.I yacc-generated parsers are not predictive but use a parsing technique calledLALR(1)6.I yacc-generated parsers� can deal with left-recursion in rules;� do not need their rules left-factorized;� are able to use operator precedence and associativity tables to resolve conflicts

    during rule selection.

    61-symbol Lookahead, Left-to-Right Parse, Rightmost Derivation, see [1, 2].

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 63

  • I A yacc input file comes in the general formparser declarations (tokens, operator precedence/associativity)%%

    grammar rules

    %%

    C program code

    I Parser declarations:1 %token OR AND EQ ...

    2 %token BOOL ID SELECT FROM WHERE ...

    3

    4 %left OR

    5 %left AND

    6 %nonassoc EQ NEQ LEQ GEQ LT GT

    7 %left ’+’

    8 %left ’*’

    9 %nonassoc NOT

    � Implements this operatorprecedence table:

    level operator(s)1 OR2 AND3 = = < >4 +5 *6 NOT© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 64

  • I Grammar rules:1 boolean : BOOL

    2 | boolean AND boolean

    3 | boolean OR boolean

    4 | NOT boolean

    5 | comparison

    6 | ’(’ boolean ’)’ ;

    7

    8 comparison : expr rel expr ;

    ✎ Shift or reduce?The parser uses the precedence declarations to resolve conflicts like this:hBOOL; truei hANDi hBOOL; falsei

    �!hORi hBOOL; truei

    (� marks how many tokens the parser has seen so far.)

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 65

  • ✎ yacc Input FileTake a closer look at the yacc input file for our SQL dialect and doublecheck the corresponding parse trees.

    I Parse tree for SQL querySELECT 2 + 3 * A

    FROM R

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 66

  • 1.3 Recommended Reading

    [1] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques,

    and Tools, Addison Wesley, 1986. ISBN 0-201-10088-6. The “Dragon” Book.

    [2] A. W. Appel, Modern Compiler Implementation in C: Basic Techniques,

    Cambridge University Press, 1997. ISBN 0-521-58653-4.

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 67

  • Contents

    0 Introduction to Query Compilation 1

    0.1 Welcome! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    0.2 Administrativa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    0.3 Some Remarks on these Slides . . . . . . . . . . . . . . . . . . . . . . 4

    0.4 Relational Databases and SQL . . . . . . . . . . . . . . . . . . . . . . 6

    0.5 A Guided Tour through an SQL Query Processor . . . . . . . . . . . . 14

    0.5.1 Character Streams and Tokens . . . . . . . . . . . . . . . . . . 15

    0.5.2 Identify Valid SQL Syntax . . . . . . . . . . . . . . . . . . . . . 16

    0.5.3 Resolve the Meaning of Variables and Identifiers . . . . . . . . 17

    0.5.4 Check Types and Schemas . . . . . . . . . . . . . . . . . . . . . 18© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 68

  • 0.5.5 How Could Query Evaluation Look Like? . . . . . . . . . . . . 19

    0.5.6 One Query, Many Programs . . . . . . . . . . . . . . . . . . . . 20

    0.5.7 Query Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    0.5.8 Query Operator Trees and Rewriting . . . . . . . . . . . . . . . 22

    0.5.9 The Cheaper, the Better: Query Cost Models . . . . . . . . . . 23

    0.5.10 Executing Operator Trees on a Machine . . . . . . . . . . . . . 24

    0.6 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    1 Query Parsing 28

    1.1 The Scanner (Lexer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    1.1.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . 33

    1.1.2 Regular Expressions for SQL . . . . . . . . . . . . . . . . . . . 35

    1.1.3 lex – A Scanner Generator . . . . . . . . . . . . . . . . . . . . 39© 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 69

  • 1.2 Syntax Analysis (Parsing) . . . . . . . . . . . . . . . . . . . . . . . . . 42

    1.2.1 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . 44

    1.2.2 Derivations and Parse Trees . . . . . . . . . . . . . . . . . . . . 46

    1.2.3 Ambiguous Grammars . . . . . . . . . . . . . . . . . . . . . . . 50

    1.2.4 Predictive Parsing and Recursive Descent . . . . . . . . . . . . 54

    1.2.5 yacc – A Parser Generator . . . . . . . . . . . . . . . . . . . . 63

    1.3 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    © 2010 T. Grust · Database Languages and their Compilers: 1. Query Parsing 70