Ply py con 2014 - online version

56
PLY compilers in Python Robert Szefler PyCon PL 2014

Transcript of Ply py con 2014 - online version

PLYcompilers in Python

Robert SzeflerPyCon PL 2014

Warm-up question

What is a lathe?

PLY: compilers in Python

A lathe is a drill working horizontally.

PLY: compilers in Python

Actually, it’s much, much more complex, delicate and precise than a simple drill.

Learning to use a lathe will take you weeks, in contrast you can just plug the drill and start drilling right away.

So, why would you want to use a lathe?

PLY: compilers in Python

Actually, it’s much, much more complex, delicate and precise than a simple drill.

Learning to use a lathe will take you weeks, in contrast you can just plug the drill and start drilling right away.

So, why would you want to use a lathe?

Obviously: to create some awesome stuff.

Which means making decent money.

PLY: compilers in Python

Stuff you can make with a lathe:

Actually, almost every physical object that is produced today is either at some point processed on a lathe, or the tools to make that object are made on a lathe.

PLY: compilers in Python

Stuff you can make with a drill:

Yeah, you can impress your partner with that. But good luck basing your career on that.

What does that have to dowith Python and compilers?

PLY: compilers in Python

Most of us know regular expressions.

PLY: compilers in Python

Most of us know regular expressions.

Simple to use and quite functional.

If you hack hard enough, you can solve quite a bit of problems with them.

PLY: compilers in Python

Most of us know regular expressions.

Simple to use and quite functional.

If you hack hard enough, you can solve quite a bit of problems with them.

Just like with a drill.

PLY: compilers in Python

Sometimes you have to hack them very hard indeed. And it starts to feel it’s not exactly the best solution any more.

PLY: compilers in Python

When you think about fixing your drill in a horizontal position…

PLY: compilers in Python

When you think about fixing your drill in a horizontal position…

or parsing XML or JSON with regular expressions…

or parsing mathematical expressions with regular expressions…

PLY: compilers in Python

When you think about fixing your drill in a horizontal position…

or parsing XML or JSON with regular expressions…

or parsing mathematical expressions with regular expressions…

then you are in a dire need for a better tool…

a proper lathe, or maybe some context-free grammars.

A context-free what?

PLY: compilers in Python

Regular expressions are really definitions of languages - regular languages.

PLY: compilers in Python

Regular expressions are really definitions of languages - regular languages. Languages are just sets of words. And words are sequences of symbols (say, characters.)

An example:

[ab]+

defines the following language

{a, b, aa, ab, ba, bb, aaa, aab, aba, abb, …}

Note the ellipsis. Most (all?) interesting languages are infinite (they have an unlimited amount of words). There is a neat theoretical result concerning the infiniteness, check it out in your spare time: the pumping lemma.

PLY: compilers in Python

Regular expressions are really definitions of languages - regular languages.

Context-free languages are regular languages on steroids.

PLY: compilers in Python

Regular expressions are really definitions of languages - regular languages.

Context-free languages are regular languages on steroids. They are richer. Every regular language is a context-free language, but not the other way round.

The notation used to describe context-free languages is context-free grammars, just like the notation to describe regular languages is regular expressions.

PLY: compilers in Python

Context-free grammars look like this:

The black things are nonterminal symbols - ones that have further structure. The khaki things are terminal symbols - ones that have no further structure, at least no structure on the level of the CFG. The arrow denotes a production (just a fancy name for a rule) - there are 4 productions in the grammar above.

E → number E → E + E

E → ( E ) E → E * E

PLY: compilers in Python

Context-free grammars look like this:

It’s intuitively clear what’s going on here. CFGs are quite readable (but their theory is rich with technicalities.)

The set of valid expressions defined by this grammar contains things like

3, 1+5, 2*(1+4), (5*15)+(3)

E → number E → E + E

E → ( E ) E → E * E

Context-free grammars look like this:

It’s intuitively clear what’s going on here. CFGs are quite readable (but their theory is rich with technicalities.)

That’s a simple language of mathematical expressions. You can easily come up with examples similar to the above, for XMLish or JSONish grammars.

PLY: compilers in Python

E → number E → E + E

E → ( E ) E → E * E

Parsing

So you know what a context-free grammars are and why they are so great. You can even whip out some simple, but useful languages in a minute or two. Now obviously we would want to actually use these to process some input.

And you need a parser to do that, and a lexer too.

PLY: compilers in Python

So you know what a context-free grammars are and why they are so great. You can even whip out some simple, but useful languages in a minute or two. Now obviously we would want to actually use these to process some input.

And you need a parser to do that, and a lexer too.

A parser, for our purposes, is the part that processes the CFG definition. A lexer is a layer that is concerned with actual characters/bytes of input and turns them into tokens (terminal symbols in the CFG) that are fed to the parser.

PLY: compilers in Python

What we are interested in, while parsing, is not only the question whether a given input parses correctly, i.e. whether it belongs to the defined grammar. We need to know the exact derivation (decomposition) using the rules of this grammar.

PLY: compilers in Python

What we are interested in, while parsing, is not only the question whether a given input parses correctly, i.e. whether it belongs to the defined grammar. We need to know the exact derivation (decomposition) using the rules of this grammar.

E.g. we need to be sure whether

x * y + z

should be understood as

(x * y) + z or x * (y + z)

PLY: compilers in Python

PLY: compilers in Python

+

* *x * y + y * z

x y y z

Plus(Mul(Var(‘x’), Var(‘y’), Mul(Var(‘y’), Var(‘z’)))

Input

Abstract syntax tree

Parse tree (concrete objects)

The most important slide of this talk ;)

Parser

What is PLYand how can it help

Enter yacc and lex, or rather, bison andflex - the ubiquitous parsing tools.

bison is a parser generator. It takesa definition of a grammar, executessome (heavy!) magic, and generates Csource code that parses the language specified by that grammar.

PLY: compilers in Python

Enter yacc and lex, or rather, bison andflex - the ubiquitous parsing tools.

bison is a parser generator. It takesa definition of a grammar, executessome (heavy!) magic, and generates Csource code that parses the language specified by that grammar.

flex is a lexer generator. It takes a series of regular expressions and outputs some C code that outputs, for a stream of input characters, a sequence of matches to these REs. Quite like re.compile in Python.

PLY: compilers in Python

yacc/bison and lex/flex are traditional,very mature and very popular tools.Some examples of production grammarsthat are/were implemented using thesetools are:

GCC (up to 3.4/4.1), Bash, PostgreSQL, Ruby, Go, PHP

Many yacc/lex-remake tools in different technologies do essentially same stuff and are used in countless products that need to understand non-trivial input.

Fine print: Some projects, including newer releases of GCC and CPython, don’t use stock parser generators. The reasons are generally twofold: either because of project history (CPython) or because the language itself is actually too complex and ambiguous to parse cleanly using standard tools (GCC’s C and C++ parsers), even with tools that are more theoretically and practically powerful than bison. It should be strongly accented, though, that writing a nontrivial parser without using a standard parsing toolkit is an extremely time consuming and error prone endeavor and not something that 99,9% of organizations would ever want to consider.

PLY: compilers in Python

One of these follow-on projects is PLY,the Python Lex-Yacc by David Beazley.

http://www.dabeaz.com/ply/

PLY: compilers in Python

One of these follow-on projects is PLY,the Python Lex-Yacc by David Beazley.

http://www.dabeaz.com/ply/

● a mature, stable tool (developed since 2001)● supports Python at least up to 3.2● bison+flex in one package● somewhat less sophisticated than original bison+flex● very straightforward to use

PLY: compilers in Python

Show me some code!

Parse tree nodes

PLY: compilers in Python

class Var(object): def __init__(self, var_name): self.var_name = var_name

def eval(self, context): return context[self.var_name]

def __str__(self): return '@%s' % self.var_name

class BinOpNode(object): def __init__(self, ln, rn): self.ln = ln self.rn = rn

def eval(self, context): # self.apply defined in derived classes return self.apply(self.ln.eval(context), self.rn.eval(context))

def __str__(self): # self.op_sym defined in derived classes return '%s(%s,%s)' % (self.op_sym, self.ln, self.rn)

class Plus(BinOpNode): op_sym = '+'

def apply(self, lv, rv): return lv + rv

class Mul(BinOpNode): op_sym = '*'

def apply(self, lv, rv): return lv * rv

class Var(object): def __init__(self, var_name): self.var_name = var_name

def eval(self, context): return context[self.var_name]

def __str__(self): return '@%s' % self.var_name

class BinOpNode(object): ...

Parse tree nodes

PLY: compilers in Python

Usage example:expr = Plus(Mul(Var('x'), Var('y')), Var('x')) # x * y + xcontext = {'x': 38, 'y':52}print(expr.eval(context))>>> 2014

class Plus(BinOpNode): op_sym = '+'

def apply(self, lv, rv): return lv + rv

class Mul(BinOpNode): op_sym = '*'

def apply(self, lv, rv): return lv * rv

Lexer code

Essentially, a bunch of regular expressions. PLY, keeping with the (bad) tradition of lex/flex/yacc/bison, by default creates a global (as in global scope) lexer. This can be overridden, though, to make the code cleaner.

PLY: compilers in Python

import ply.lex

tokens = ['VAR', 'PLUS', 'MUL']

t_ignore = ' \t' # allow whitespacet_PLUS = r'\+'t_MUL = r'\*'t_VAR = r'\w+'

ply.lex.lex()

Testing the lexer

PLY: compilers in Python

import ply.lex

tokens = ['VAR', 'PLUS', 'MUL']

t_ignore = ' \t' # allow whitespacet_PLUS = r'\+'t_MUL = r'\*'t_VAR = r'\w+'

ply.lex.lex()ply.lex.input('tmp1 + y*z_3')

while True: tok = ply.lex.token() if not tok: break # EOF print(tok.type, tok.value, tok.lineno, tok.lexpos)

>>> VAR tmp1 1 0>>> PLUS + 1 5>>> VAR y 1 7>>> MUL * 1 8>>> VAR z_3 1 9

You don’t normally use a lexer directly, unless you are building some kind of a simplistic stream-type parser (think html.parser in python3, expat etc.)

As mentioned previously, the reason we use a lexer is to feed tokens to the parser, hiding the low-level details of character/byte processing. E.g. we don’t want to encode the \w+ pattern (used for recognizing variable names in the last example) as a set of reductions in the grammar, it’s much more natural and simple to handle it with a single regex on the lexer level and return a token than spans the appropriate characters.

PLY: compilers in Python

The parser

PLY: compilers in Python

import ply.yacc

import lexerfrom nodes import Plus, Mul, Var

tokens = lexer.tokens

def p_plus_expr(p): ' expr : expr PLUS expr ' p[0] = Plus(p[1], p[3])

def p_mul_expr(p): ' expr : expr MUL expr ' p[0] = Mul(p[1], p[3])

def p_var(p): ' expr : VAR ' p[0] = Var(p[1])

class ParseError(Exception): pass

def p_error(p): # TODO raise ParseError

precedence = ( ('left', 'PLUS'), ('left', 'MUL'))

parser = ply.yacc.yacc()

Does it even work?

The parser returns the topmost parse tree object (Var, Plus or Mul) with its contents built up recursively according to specified CFG productions. We get a nice string representation from the defined __str__’s.

Note the last two examples - operator precedence and associativity work as specified.PLY: compilers in Python

from parser import parser, ParseError

try: print(parser.parse('x +'))except ParseError: print("Parse error!")print(parser.parse('var'))print(parser.parse('x + y'))print(parser.parse('x + a * b'))print(parser.parse('x * a + b'))print(parser.parse('v * v * v'))

>>> Parse error!>>> @var>>> +(@x,@y)>>> +(@x,*(@a,@b))>>> +(*(@x,@a),@b)>>> *(*(@v,@v),@v)

Another example - a bc (almost).

We have just built a simple domain-specific language. In fact, the processing implemented with .eval() above is probably too simplistic for anything except a simple calculator. Realistically, we would want to implement some sophisticated tree-walking strategy or maybe even tree rewriting.

This is the moment where the scope of my presentation ends. Go get a compiler writing book for more ;)PLY: compilers in Python

from parser import parser

expr = parser.parse('x*x + y*z')result = expr.eval({'x':6, 'y':3, 'z':2})print(result)

>>> 42

How can I make a buckoff this stuff?

PLY: compilers in Python

Where could a parser in Python come in handy?

● support user-defined expressions in reporting-type software (my personal experience - excellent result, quick to develop, very powerful for final users)

● parse custom configuration files● prototype novel programming languages

All these use cases boil down essentially to...

PLY: compilers in Python

Where could a parser in Python come in handy?

● support user-defined expressions in reporting-type software (my personal experience - excellent result, quick to develop, very powerful for final users)

● parse custom configuration files● prototype novel programming languages

All these use cases boil down essentially to empowering users of the software we develop by letting them actually program it, in the most* convenient and general way possible. Frequently this means competitive advantage and overall coolness :)

* developing complex grammars and parsers for them in general takes a significant amount of time, even with a tool as convenient as PLY, so there will always be practical limits to this sophistication.

PLY: compilers in Python

Gory details

This all of course is actually not so simple as presented.

PLY: compilers in Python

This all of course is actually not so simple as presented.

● You don’t normally parse generic context-free grammars with tools like yacc, bison, and PLY. They are usually applied to subsets of the CFG universe, such as LR(0), SLR and LALR(1). For generic CFGs we’d need more comprehensive, slower algorithms.GLR, Tomita… some of this stuff is actually implemented in newer versions of bison, but not PLY. OtherPythonic tools exist for this, though (I’m not sure they’re as functional as PLY - never tried them)

PLY: compilers in Python

This all of course is actually not so simple as presented.

● You don’t normally parse generic context-free grammars with tools like yacc, bison, and PLY. They are usually applied to subsets of the CFG universe, such as LR(0), SLR and LALR(1). For generic CFGs we’d need more comprehensive, slower algorithms.GLR, Tomita… some of this stuff is actually implemented in newer versions of bison, but not PLY. OtherPythonic tools exist for this, though (I’m not sure they’re as functional as PLY - never tried them)

● Conflicts in applying rules arise frequently (famous: dangling else in C); handling them can be frustrating

PLY: compilers in Python

This all of course is actually not so simple as presented.

● You don’t normally parse generic context-free grammars with tools like yacc, bison, and PLY. They are usually applied to subsets of the CFG universe, such as LR(0), SLR and LALR(1). For generic CFGs we’d need more comprehensive, slower algorithms.GLR, Tomita… some of this stuff is actually implemented in newer versions of bison, but not PLY. OtherPythonic tools exist for this, though (I’m not sure they’re as functional as PLY - never tried them)

● Conflicts in applying rules arise frequently (famous: dangling else in C); handling them can be frustrating

● Handling parse errors and presenting them to users sensibly in bison-type tools is notoriously hard (this was the main reason GCC dropped bison for a home bake parser)

PLY: compilers in Python

This all of course is actually not so simple as presented.

● You don’t normally parse generic context-free grammars with tools like yacc, bison, and PLY. They are usually applied to subsets of the CFG universe, such as LR(0), SLR and LALR(1). For generic CFGs we’d need more comprehensive, slower algorithms.GLR, Tomita… some of this stuff is actually implemented in newer versions of bison, but not PLY. OtherPythonic tools exist for this, though (I’m not sure they’re as functional as PLY - never tried them)

● Conflicts in applying rules arise frequently (famous: dangling else in C); handling them can be frustrating

● Handling parse errors and presenting them to users sensibly in bison-type tools is notoriously hard (this was the main reason GCC dropped bison for a home bake parser)

● Other than that, CFGs an PLY rock :)PLY: compilers in Python

PLY: compilers in Python

That’s all folks

Happy parsing

… and remember we arehiring heavily at WebInterpret

Thank you!www.webinterpret.com