Small Team, Big Demands? - Use The Batteries Included

42
Small team, big demands? - Use the batteries included Jussi Rasinmäki, Simosol Oy EuroPython 2008 1

description

My EuroPython 2008 talk: a tale of a software project - simulation and optimization for forest management - rescued by Python!

Transcript of Small Team, Big Demands? - Use The Batteries Included

Page 1: Small Team, Big Demands? - Use The Batteries Included

Small team, big demands? - Use the batteries included

Jussi Rasinmäki, Simosol Oy

EuroPython 2008

1

Page 2: Small Team, Big Demands? - Use The Batteries Included

Outline

• This is a tale of a software project - and how Python came to the rescue

• “Batteries included” in the title? Means both the standard library and the Python Package Index

2

Page 3: Small Team, Big Demands? - Use The Batteries Included

The project as it is now: SIMO

• SIMO (SIMulation and Optimization) is a framework for natural resource management planning

• originally developed in Finland at the University of Helsinki for forest management planning

• Consists of two main parts

• The simulator is used to generate several alternative scenarios – how things in nature could evolve due to both the natural and human processes

• The optimizer is then used to select the best scenario given the goals and restrictions for the planning problem

• End result: an optimal set of management actions for the planning period

3

Page 4: Small Team, Big Demands? - Use The Batteries Included

Key benefits

• Adaptable: flexible enough in its structure to

• use a wide range of input data as the starting point for the planning computations

• be suitable for different natural resource management planning problems

• be used around to world, SIMO can be easily localized to different natural conditions

• Cross-platform & Open source

4

Page 5: Small Team, Big Demands? - Use The Batteries Included

The beginning

• It’s 2004: Finland is still 70 % forest and despite Nokia & mobile phones, forest industry is still one of the pillars of the Finnish economy.

• Forest management planning == when & how much to harvest; i.e., cut & sell trees (to put it very roughly)

• The problem is the long time frame, trees grow slowly

• we need a simulator to predict the possible future states of the forests

5

Page 6: Small Team, Big Demands? - Use The Batteries Included

The problem

• There is a simulator for forest management planning in Finland, but:

• it’s a proprietary black box

• it works very well for the purpose it was designed for, but definitely not for all the use cases

• it’s hard to get new requirements implemented into the program

6

Page 7: Small Team, Big Demands? - Use The Batteries Included

The solution?

• A joint project between all the forestry companies, forestry organisations and the University of Helsinki was started:

• an open source planning software; no more black boxes

• make it adaptable, as few fixed parts as possible

7

Page 8: Small Team, Big Demands? - Use The Batteries Included

The team

• A three year project with three researchers (myself being one of them)

• All of whom are foresters, not professional software developers, let alone computer scientists

• The team was backed up by a professional software developer with decades of programming experience in C & C++

8

Page 9: Small Team, Big Demands? - Use The Batteries Included

The task

“The system should be flexible with regards to the data and models used. It should be adaptable to different planning problems and extendable to cover the future planning needs.”

9

Page 10: Small Team, Big Demands? - Use The Batteries Included

What???

10

Page 11: Small Team, Big Demands? - Use The Batteries Included

11

Page 12: Small Team, Big Demands? - Use The Batteries Included

12

Page 13: Small Team, Big Demands? - Use The Batteries Included

13

Page 14: Small Team, Big Demands? - Use The Batteries Included

14

Page 15: Small Team, Big Demands? - Use The Batteries Included

• The view of the forest changes depending on from where are you looking at it

• Similarly, the data about the forest changes depending on how you collect it

• Remote sensing technology advances => we are moving closer and closer to the forest, towards more detailed view of it

15

Page 16: Small Team, Big Demands? - Use The Batteries Included

Dealing with the data

• A hierarchical, generic data model

• consists of objects that have attributes and sub objects

• Each attribute is a value - value interpretation pair

• Two kinds of attributes:

• numerical have a unit of measurement

• categorical have a complete enumeration of the possible values

16

Page 17: Small Team, Big Demands? - Use The Batteries Included

Implementing the data model

• “Hierarchical structure, values associated with their meaning… hey, this sounds like XML!”

Flashback from the past: this is how the firstdraft of the data modellooked like

And this is how it looks like now, the data doesn’t really come as XML encoded, but the data model instance used is defined in XML

17

Page 18: Small Team, Big Demands? - Use The Batteries Included

“Hey, it works! Let’s take this further”

• Implementing the XML based data model in C: three clueless foresters with a lot of guidance from a C-professional were able to implement a working prototype

• What about the other requirement of modifiable computations to accommodate different planning tasks?

• Separation of the simulation logic from the simulator logic

• We already started with XML, so we extended that to the simulation logic as well

18

Page 19: Small Team, Big Demands? - Use The Batteries Included

The computation logic as data

• Again a hierarchical structure: a simulation is an ordered collection of tasks

• the execution of the task may depend on a condition

• a task may be divided into sub tasks

• but ultimately a task is carried out by a model

19

Page 20: Small Team, Big Demands? - Use The Batteries Included

Implementing the first draft, take two

• Three months down; first draft of the ideas implemented in C, things begin to look scary

• needed to lean heavily on the C guy

• will we be able to pull this off in the time allocated? Not likely.

• “I read an article about this language called Python, shall we try it?”

• Python & XML in 2004: “ElementTree seems nice, let’s try that”

• Two weeks and we had a working prototype on par with functionality of the C prototype

20

Page 21: Small Team, Big Demands? - Use The Batteries Included

Python and the first draft

• 10x smaller, twice as fast as the C version

• When you write really bad C code, it can be quite slow…

• And conversely, if you don’t have to struggle when converting your ideas into code, the code can be reasonably efficient

• "Clearly, this is going to work"

• the project management didn’t disagree with this conclusion and thus it was Python from there on

21

Page 22: Small Team, Big Demands? - Use The Batteries Included

"Where the use of XML definitely went too far"

This is sort of manageable:

But it gets pretty bad pretty soon:

22

Page 23: Small Team, Big Demands? - Use The Batteries Included

XML: Taking a step back

• Instead of using XML for everything, let’s use alternative syntax for those parts where it’s more suitable:<condition>comp_unit:CULT == 1 and comp_unit:tending_of_seedling_stand times_eq 0 and comp_unit:thinning times_eq 0</condition>

• But, the XML parser doesn’t provide any help in converting the condition string into a condition object that could be evaluated

• currently evaluating pyparsing for this, looks promising

23

Page 24: Small Team, Big Demands? - Use The Batteries Included

XML: The problem with ElementTree

• Only one: no schema support

• We’ve written the XML documents using an XML editor. A good XML editor provides XML Schema (or other schema) support to validate the contents of the document. This is a tremendous help in writing XML

• Should the users of the system write raw XML?

• No way!

• GUI development about to start: the XML documents are abstracted into UI components => need for XML validation inside the system

24

Page 25: Small Team, Big Demands? - Use The Batteries Included

XML: moving forwards

• lxml has schema validation, and it provides the same API as ElementTree

• Painless migration, except

• lxml is not part of the standard library in Python

• an extra headache to take care of when deploying

• Closing the circle, lxml is based on libxml2, which we used in the initial C version

25

Page 26: Small Team, Big Demands? - Use The Batteries Included

"Do we really want to reparse everything every time?"

• XML documents describe the system => those need to be parsed into the corresponding Python objects when the simulation is constructed

• However, the XML content is not that dynamic. Once a system is described, most of the time it stays pretty static, only minor tweaks here and there

• Decided to use Berkeley DB (bsddb) to store pickled objects: like using dictionary, except you get a true file system persisted database

def get_data(self, key): data = self.db.get(key, default=None) if data is None: return None else: return pickle.loads(data)

def add_data(self, key, data): value = pickle.dumps(data, protocol=2) self.db.put(key, value)

26

Page 27: Small Team, Big Demands? - Use The Batteries Included

“What about those models?”

• Now we have a system that has

• hierarchical data with dynamic content

• user-modifiable simulation logic description

• I mentioned earlier that ultimately each task in the simulation logic description is executed by a model

• We need a way to introduce the model implementations to the system

27

Page 28: Small Team, Big Demands? - Use The Batteries Included

Model libraries

• The models are collected into model libraries

• The bulk of them are shared libraries (dll, so, dylib) written in C, part of them in Python modules

• Each model also has an XML description detailing its input and output. This information is used at runtime to process the data passed to and from the model.

• There is a standard interface each model must implement => connecting the C models to the Python simulator is easy, thanks to ctypes

28

Page 29: Small Team, Big Demands? - Use The Batteries Included

ctypes & models

• ctypes is used to load the shared library

libc = ctypes.CDLL(library)

• A Python wrapper function is generated semi-automatically for each model based on the header file content. The wrapper calls the function in the shared library (libc in the example):

int Age_pine_hemib_h_KalliovirtaTokola (double h, int *nres, double *modelresult, char *errors, int errorCheckMode, double allowedRiskLevel, double rectFactor);

=>def Age_pine_hemib_h_KalliovirtaTokola(arg, libc, i): return libc.Age_pine_hemib_h_KalliovirtaTokola( arg.variables[i][0], byref(arg.nres[i]), arg.mem[i], byref(arg.errors[i]), arg.error_check_mode, arg.allowed_risk_level, arg.rect_factor)

29

Page 30: Small Team, Big Demands? - Use The Batteries Included

ctypes & model parameters

• The model XML description is used to generate the input and output structures for the model in Python, those are then converted into C structures using ctypes, and vice versa:def python2ctypes(self): c_int = ctypes.c_int c_double = ctypes.c_double c_char = ctypes.c_char nobj, nvar = self.variables.shape vals = self.variables self.variables = map(lambda x: [c_double(y) for y in vals[x,:]], xrange(nobj)) self.error_check_mode = c_int(self.error_check_mode) self.rect_factor = c_double(self.rect_factor) self.allowed_risk_level = c_double(self.allowed_risk_level) for i in xrange(len(self.parameters)): self.parameters[i] = ctypes.c_double(self.parameters[i]) rng = range(len(self.target_objects)) nvalues = self.mem.shape[1] self.nres = map(lambda x: c_int(), rng) self.mem = map(lambda x: (nvalues * c_double)(), rng) self.errors = map(lambda x: (c_char * 200)(), rng)----------------------------------------------------------------------------------def ctypes2python(self): self.errors = numpy.array([str(e.value) for e in errors]) rng = xrange(self.target_objects.size) self.mem = numpy.array(map(lambda x: [float(y) for y in mem[x]], rng)) self.nres = numpy.array(map(lambda x: int(nres[x].value), rng)) self.allowed_risk_level = float(self.allowed_risk_level.value) self.error_check_mode = float(self.error_check_mode.value) self.rect_factor = float(self.rect_factor.value)

30

Page 31: Small Team, Big Demands? - Use The Batteries Included

The problem with models

• The function objects loaded from the shared libraries can’t be pickled

• That part of the system has to be generated every time a simulation is run

• Maybe ZODB could handle this, will explore this as part of a larger refactoring of the code

31

Page 32: Small Team, Big Demands? - Use The Batteries Included

“Man, is this ever so slow!”

• The adaptability of the system doesn’t come without its price

• The first iteration of the Python version was based on walking the DOM-trees, then moved to Python objects based on nested list and dictionary structures

• The structure had evolved slowly over couple of years when we explored what the system could and should be like

• Slow, very slow, which was only natural as performance was not the key design criteria

• However, some of the potential users have millions of simulation units to simulate and the simulations cover decades

32

Page 33: Small Team, Big Demands? - Use The Batteries Included

Fixing the performance

• Partial rewrite of the data handling to utilize multidimensional arrays using numpy

• Order of magnitude rise in performance; although this does include two C extensions of our own

• Now the functionality set of the simulator is somewhat complete and refactoring and partial rewrite is underway

• the new version is based completely on numpy arrays: the content of the XML documents is mapped to a 4 dimensional data array and its processing

33

Page 34: Small Team, Big Demands? - Use The Batteries Included

Fixing the performance, part 2

• Looking into ipython to provide parallel processing capabilities to the simulator

• the simulation is “embarrassingly parallel”, each simulation unit is simulated independently of the others

• this is also part of the numpy rewrite, we started with processing just one simulation unit at a time, the next version will have simulation units as one of the dimensions of the data array

• as many units as memory allows are processed together

34

Page 35: Small Team, Big Demands? - Use The Batteries Included

Testing

• The old code base was at one point covered by unit tests quite well

• however, the tests were bolted on as an afterthought

• so, they deteriorated

• not used to test driven development, too much testing overhead to really acquire a taste for it

• probably mainly because the tests were written after the code already was quite complex => complex setup

35

Page 36: Small Team, Big Demands? - Use The Batteries Included

Fixing testing

• In the refactoring we are using doctests in ReST files together with nose. Suits at least me much better & provides documentation at the same time

• previously documentation was generated with Natural Docs from the docstrings

• Nose brings automated testing into doctests:nosetests --with-doctest --doctest-extension=rst

• As I learned on Monday, py.test might provide even better support for doctests

36

Page 37: Small Team, Big Demands? - Use The Batteries Included

“That's a whole lot of numbers, could do with a graph”

• A user interface wasn’t part of the project, however, we needed some way of analysing the simulator output

• matplotlib provided a nice and quick way to turn a whole bunch of numbers into an image without really building a GUI

• Now that the GUI is on the cards, matplotlib may have ended its service inside the simulator

• So far, deploying has consisted of a single exe file compiled using py2exe

• Some difficulties to get matplotlib play nicely with this

37

Page 38: Small Team, Big Demands? - Use The Batteries Included

A lesson in "think before you act"

• Reporting was done on the “we need this kind of report now” basis

• Different kind of text files for simulator output

• simulator output is stored in a Berkeley DB instance as pickled Python objects

• can potentially lead to huge output files

• With hindsight, this kind of ad hoc reporting was the wrong way to go. Things culminated in aggregation reports for which we ended up writing a “layman’s SQL engine”. Not realising it at the time of course.

38

Page 39: Small Team, Big Demands? - Use The Batteries Included

Fixing reporting

• Although for the data model as such the relational model is a no go, for output and reporting it should have been the obvious choice

• Reporting is due to a complete overhaul

• sqlite3 a strong contender to provide the simulator output storage in a relational database

• reporting then based on that

• something not found in the “battery assortment” so far; a generic reporting framework

39

Page 40: Small Team, Big Demands? - Use The Batteries Included

What about the optimization?

• Bulk of the work went into developing a simulator that would fulfil the requirements

• Once the users are able to use simulators that are set up the way they want, the optimization is “just” a matter of picking the right combination out of the pool of all possible alternatives

• This is done by connecting existing optimization libraries to the software

• for now there is a linear programming package and a couple of heuristic optimization algorithms

• openopt hasn’t been connected yet, but is definitely on the TODO list

40

Page 41: Small Team, Big Demands? - Use The Batteries Included

A few words about project management

• Used subversion from the start

• saved our back in several occasions, when something had gone wrong, but we weren’t quite sure what it was

• When the project ended and the results were published in November ´07, we set up a trac site to provide bug tracking, a wiki and a nice view of the svn repository

• trac hasn’t seen too much use so far as the main effort has been in rewriting the core, but undoubtedly it’s role will increase for the project

• Pypi egg of SIMO is waiting for the new version with its new dependencies

41

Page 42: Small Team, Big Demands? - Use The Batteries Included

Want to know more about SIMO?

• www.simo-project.org

• The developers of SIMO: www.simosol.fi / [email protected]

• You can reach me directly at [email protected]

42