Remoting Django (and other python) methods to be run inside the
PostgreSQL database
PgWest 2010
Hannu Krosing
About me
Hannu Krosing
skype: hkrosing
* Using python since v. 1.1
* Using linux since v. 0.9
* PostgreSQL since it was Postgres 4.2 (using postQuel as its query language instead of SQL)
Senior PostgreSQL Consultantwww.2ndQuadrant.com
Technical AdvisorAmbient Sound Investments
www.asi.ee
Python & PostgreSQL Background
Large portion of python code I have written has had to do with databases, be itbinary protocol or file format hacking, or using python to automate data loading,backups or just plain data manipulation.
I was the first (and for some years the only) DBA for Skype, where we usedpython for most data backend manipulation tasks and where I invented a Proxying and database partitioning language pl/proxy which lets you build unlimited scalability into your database backends with relative ease , both from developer and administrator viewpoint.
For last few years I have been helping companies to manage and scale theirPostgreSQL databases as a Database Consultant at 2ndQuadrant(www.2ndQuadrant.com) , as well as evaluating technical merits of startups atASI , the investment company of Skype founding engineers (www.asi.ee).
The 1st problem to solve
● Many people use postgreSQL from python, but don't want to dive deeply into SQL
● Even when you are good at using SQL from another language, it is still cumbersome, due to different ways of using and different expectations of users (an “impedance mismatch”)
Solution to the 1st problem
● Use an Object Relational Mapper (ORM)● ORMs replace SQL with something
syntactically more close to hosting language
● Django has a good and flexible ORM
ORM vs DB API
● Sample DB API update to table rowcur.execute('update users set password=%s where name=%s', ('frog', 'bob')))
● Same using Django ORMusr=User.get(pk='bob')
usr.password = 'frog'
usr.save()
Other available ORMs● SQLAlchemy
keeps Object definition and Mapper separate, joins when running
● STORM (GPL)● SQLObject (mostly superseeded by more
flexible SQLAlchemy)
Main problem with ORMs● Most ORMs tend to generate lots of distinct database
requests and do not make use of most database features, at least when used naively.
● Not using good database features tends to prefer databases with less features.
● Most ORMs have ways to circumvent this and even write your own SQL, but then why use the ORM in the first place.
More problem with ORMs
Next ones are due to not using functions to access database:
● There is no easy way for a DBA to go and fix inefficient functionality without changing the source code of client application.
● You can't use pl/proxy so it is hard to use partitioning for scaling.
Is this really a problem ?
● How bad is the “many db requests” problem
● As each db request has about 0.4ms overhead even when doing nothing you can easily get a large total runtime for simple things.
● This does usually not show up at development time, when you are the only user.
Sample schema-- 100000 items create table items( id serial primary key, item_name text not null, price nume(10,4), sale_price numeric(10,4) );
-- 100000 items_availablecreate table items_available(items_availableid serial primary key references items(id),quantity int);
-- 100000 items create table items( id serial primary key, item_name text not null, price nume(10,4), sale_price numeric(10,4) );
-- 1000000 order_linescreate table order_lines(order_linesid serial primary key,order_id int not null references orders(id),item_id int not null references items(id),quantity_ordered int,quantity_shipped int);
Shipping an order using Django ORMorder=Order.get(id=1000)for line in order__order_lines: item_av = ItemAvailable.get(pk=line.id) if line.quantity_ordered > item_av.quantity: line.quantity_shipped = item_av.quantity item_av.quantity = 0 else: line.quantity_shipped = line.quantity_ordered item_av.quantity = line.quantity_shipped line.save() item_av.save()
● Does 1 query to get order and another one to get ~100 order lines● Then does 100 queries to get corresponding availability data● And then 200 to save order lines and availability back● That is 300+ queries in total, each doing RPC call to database● Likely to take at least 500 ms or more● some of the queries can be avoided by using get_related but most can't
Shipping an order in pl/pythoncreate or replace function process_order(i_order_id int, out json_result text) as$$from cjson import encode, decodeif 'order_plan' not in SD: q = 'select * from orders where id = $1' SD['order_plan'] = plpy.prepare(q, [ 'int' ])<..cache other plans here..>order = plpy.execute(SD['order_plan'], [i_order_id])order_lines = plpy.execute(SD['order_line_plan'], [i_order_id])availability_query = SD['availability_plan']update_quatities_query = SD['update_quatities']update_order_shipping_query = SD['update_order_shipping']for order_line in order_lines: line_id = order_line['id'] item_id = order_line['item_id'] availability = plpy.execute(availability_query, [item_id])[0] if availability['quantity'] < order_line['quantity_ordered']: plpy.execute(update_quatities_query, [item_id, 0]) plpy.execute(update_order_shipping_query, [line_id,availability['quantity']]) else: plpy.execute(update_quatities_query, [item_id, availability['quantity'] order_line['quantity_ordered']]) plpy.execute(update_order_shipping_query, [line_id, order_line['quantity_ordered']])$$ language plpythonu security definer;
● ● One RPC call to databaseOne RPC call to database● ● Total 30-120 ms to update 2x100 rowsTotal 30-120 ms to update 2x100 rows
Problems with pl/python
● Maintaining code in 2 places
● Hard to write
● Writing this as 1-2 SQL statements would have performed even faster, so why bother
Automatic remoting
● You can automate moving code into postgreSQL to be executed by pl/python
How ?
● Write a decorator, which wraps up code and arguments, and passes them for execution to a special pl/python function
Writing a remoting decoratorAll the following samples need these modules and connections
import psycopg2import json, marshal, base64
# user who can create functionsadmcon = psycopg2.connect('dbname=pyrpcdb user=admin')admcur = admcon.cursor()
# connection for using the functionusrcon = psycopg2.connect('dbname=pyrpcdb user=bob')usrcur = admcon.cursor()
Writing a remoting decoratorDecorator as class
class run_in_database(object): def __init__(self, f): self.f = f self.code = f.func_code self.mcode = marshal.dumps(self.code) self.code64 = base64.encodestring(self.mcode) self.db_name = "%s_%d" % (f.__name__, hash(self.mcode)) def __call__(self, *args, **kargs): json_args = json.dumps((args, kargs)) usrcur.execute('select * from run_code(%s,%s)', (self.code64, json_args)) result = cjson.loads(usrcur.fetchone()[0]) return result
● serialises actual code of function● serializes arguments, passed argunments and code to pl/python● deserialises return value
Remote executor in postgreSQLAll the following samples need these modules and connections
create or replace function run_code(in code64 text, in json_args text, out json_result text) as$$import json, marshal, base64args, kargs = json.loads(json_args)
code = marshal.loads(base64.decodestring(code64))def f():passf.func_code = code
res = f(*args, **kargs)
return json.dumps(res)$$ language plpythonu security definer;
● Deserializes arguments and code● Creates a dummy function, then attaches deserialised code to it● Calls the function, serialises result and returns it
Defining and calling the remoted function
All the following samples need these modules and connections
@run_in_databasedef get_order(i_order_id): if 'order_plan' not in SD: q = 'select * from orders where id = $1' SD['order_plan'] = plpy.prepare(q, [ 'int' ]) if 'order_line_plan' not in SD: q = 'select * from order_lines where order_id = $1' SD['order_line_plan'] = plpy.prepare(q, [ 'int' ]) order = plpy.execute(SD['order_plan'], [i_order_id])[0] order_lines = plpy.execute(SD['order_line_plan'], [i_order_id]) return (order, [order_line for order_line in order_lines])
#call the decorated function, it is executed in databaseres = get_order(1000)
● Main overhead returning data, not shipping the code● Code shipping can be changed to happen only when code is changed
Some speed numbers
● Simple ”select * from t where id = n” runs in 0.1-0.7 ms
● The get_order(n) takes 4.5ms in database
● Calling it with automatic remoting from python, 6.5ms
This means that the overhead of shipping the function code is negligible.
Problems with simple remoting
● While running everything via code shipping works, it is not the best option
● No caching of functions● No caching of query plans● No further optimisation or scaling possible
● Better to create a corresponding db function for each remoted function
Advanced remotingA function that runs (and creates if necessary) a corresponding function in database
class run_as_db_function(object): def __init__(self, f): self.f = f self.py_fname = f.__name__ self.code = f.func_code self.mcode = marshal.dumps(self.code) self.code64 = base64.encodestring(self.mcode) self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) ) def __call__(self, *args, **kargs): json_args = json.dumps((args, kargs)) print self.db_fname, (json_args,) try: # try calling the function in db usrcur.execute("savepoint sp1; select * from %s(%%s)" % self.db_fname, (json_args,)) except: # define func using admin connection usrcur.execute("rollback to savepoint sp1") admcur.execute(remote_function_template % (self.db_fname, self.code64) ) admcon.commit() # run the function again usrcur.execute("select * from %s(%%s)" % self.db_fname, (json_args,)) result = json.loads(usrcur.fetchone()[0]) return result
Advanced remoting (cont)A database function template to be used when creating the function
remote_function_template = """\CREATE OR REPLACE FUNCTION %s (in json_args text, out json_result text) AS$$import json, marshal, base64
code = marshal.loads(base64.decodestring(\"\"\"%s\"\"\"))args, kargs = json.loads(json_args)
def f():passf.func_code = code
res = f(*args, **kargs)return json.dumps(res)$$ language plpythonu security definer;"""
Create and call a remoted database function
@run_as_db_functiondef double(strarg): return strarg *2
double('cat')
Advanced remoting (cont)The function that actually was created in database
CREATE FUNCTION double_1980513421(json_args text, OUT json_result text) RETURNS text LANGUAGE plpythonu SECURITY DEFINER AS $$import json, marshal, base64
code = marshal.loads(base64.decodestring("""YwEAAAABAAAAAgAAAEMAAABzCAAAAHwAAGQBABRTKAIAAABOaQIAAAAoAAAAACgBAAAAdAYAAABzdHJhcmcoAAAAACgAAAAAcwcAAAA8c3RkaW4+dAYAAABkb3VibGUBAAAAcwIAAAAAAg=="""))args, kargs = json.loads(json_args)
def f():passf.func_code = code
res = f(*args, **kargs)
return json.dumps(res)$$;
Making it scalable● The easiest way to achieve scalability with a function-based interface to postgreSQL is pl/proxy
● But pl/proxy needs a partitioning key in function signature while current approach hides all arguments inside a JSON container
● Time to add possibility to pass some arguments separately
● While we are at it, lets also cache the internal function
Scalable remoting for pl/proxyA decorator exposing an explicit partitioning key def run_as_partitioned_db_function(part_key, part_key_type="text"): class _run_as_partitioned_db_function(object): def __init__(self, f): self.f = f self.py_fname = f.__name__ self.code = f.func_code self.mcode = marshal.dumps(self.code) self.code64 = base64.encodestring(self.mcode) self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) ) self.part_key = part_key self.part_key_type = part_key_type self.part_key_pos = self.code.co_varnames.index(self.part_key) def __call__(self, *args, **kargs): json_args = json.dumps((args, kargs)) print 'args:',args print 'kargs:', kargs print self.db_fname, (args[self.part_key_pos], json_args) try: # try calling the function in db usrcur.execute("savepoint sp1; select * from %s(%%s, %%s)" % self.db_fname, (args[self.part_key_pos], json_args)) except: # define func using admin connection usrcur.execute("rollback to savepoint sp1") admcur.execute(partitioned_function_template % (self.db_fname, self.part_key_type, self.code64) ) admcon.commit() # run the function again usrcur.execute("select * from %s(%%s, %%s)" % self.db_fname, (args[self.part_key_pos], json_args)) result = json.loads(usrcur.fetchone()[0]) return result return _run_as_partitioned_db_function
Scalable remoting for pl/proxy (cont.)A function template for scalable remoting exposing key field partitioned_function_template = """\CREATE OR REPLACE FUNCTION %s (in part_key %s, in json_args text, out json_result text) AS$$import json
if not SD.has_key('_func_'): import marshal, base64 code = marshal.loads(base64.decodestring(\"\"\"%s\"\"\"))
def f():pass f.func_code = code SD['_func_'] = f
args, kargs = json.loads(json_args)res = SD['_func_'](*args, **kargs)
return json.dumps(res)$$ language plpythonu security definer;"""
This also caches the function in pl/python's local dictionary SD so it is not re-created at each call
Scalable remoting function in databaseDefining and running this in python
@run_as_partitioned_db_function('username')def hello(username, age): return 'Hello %s, you are nor %d years old' % (username, age)
hello('bob',32)
Yields the following function in the database
CREATE OR REPLACE FUNCTION hello_678701658(IN part_key text, json_args text, OUT json_result text) LANGUAGE plpythonu SECURITY DEFINER AS $$import jsonif not SD.has_key('_func_'): import marshal, base64 code = marshal.loads(base64.decodestring("""YwIAAAACAAAAAwAAAEMAAABzDgAAAGQBAHwAAHwBAGYCABZTKAIAAABOcyIAAABIZWxsbyAlcywgeW91IGFyZSBub3IgJWQgeWVhcnMgb2xkKAAAAAAoAgAAAHQIAAAAdXNlcm5hbWV0AwAAAGFnZSgAAAAAKAAAAABzBwAAADxzdGRpbj50BQAAAGhlbGxvAQAAAHMCAAAAAAI=""")) def f():pass f.func_code = code SD['_func_'] = fargs, kargs = json.loads(json_args)res = SD['_func_'](*args, **kargs)return json.dumps(res)$$;
So how would one scale thisOnce you have a function with exposed partition key, it is easy to use pl/proxy to do the actual partitioning. Just define a proxy function
CREATE OR REPLACE FUNCTION hello_678701658(part_key text, json_args text, OUT json_result text)LANGUAGE plproxy AS $$ CLUSTER 'usercluster'; RUN ON hashtext(i_username) ;$$;
And this function decides the partition to run the function in based on part_key and then calls the real function in that partition with exact same arguments as it was called and returns the result to original caller
How far is Django support for this ?While the aim of this project is support for remoting ORM-based methods and Django as the first one, we are not there yet
I have an ugly hacked version of postgresql database backend which can be run inside the database to execute Django QuerySet objects, but it is far from complete nor is it easy to use or install.
You can write straight pl/python code inside your Django app and have it remoted using the code from this tutorial
ToDo for Django support
● Complete django ORM support on pl/pythons SPI interface
– Above needs a SPI-specific database adapter
– And a DB-API compliant SPI interface, a long-time ToDo item in pl7python
● Automatic generation of Django Models from database Schema
Other nice-to-have features
● Automatic generation of pl/proxy function stubs
● Bypassing SQL when generating the query plans
● Support for other python ORMs● Support for ORMs from other languages
(Ruby on Rails anybody ?)
Conclusion
● It is possible to automatically move functions into the database for execution as demonstrated in this talk.
● It is also possible to move Django methods into database and make them scalable using pl/proxy
● Making the Django method shipping easy still needs work (donations welcome ;)
Where to get more info● http://www.2ndQuadrant.com/ - if you want professional
paid help
● http://www.postSQL.org/ - where this code will be developed and maintained
● http://plproxy.projects.postgresql.org/doc/tutorial.html Tutorial for using pl/proxy
● How to reach me:
– email [email protected]
– Skype hkrosing
Top Related