Structure versioning for PyTables

17
Python, PyTables and meta-classes Author: Marcin Swiatek , Visimatik Inc., 2007 Introduction My task was to organize genotyping data for analysis. At the time I was working on a whole-genome association study . For a programmer, the sheer size of input datasets is one of defining characteristics of such projects. In this case I had over 70 GB of compressed text to deal with. Data for a WGA study is best pictured as a very large matrix, with columns corresponding to single nucleotide polymorphisms (SNPs). Rows correspond to samples or – in human terms, people who donated their genetic material for the study. Values in cells of this huge 1 matrix indicate nucleotides found at these positions. Unfortunately, genotyping data does not necessarily arrive in a format convenient for analysis. More often it will have structure suitable for recording measurements and will need to be converted. This document and the associated source code can be downloaded from www.visimatik.com . The toolkit Python and numpy are my tools of choice for this type of work. For data persistence, I would normally use pickling or XML, but given the volume of data, neither was appropriate. Due to the required form of data, a relational database would not fit the purpose, either. Enter HDF5 and PyTables 2 . HDF5 is a data format and a library for managing large data collections, with a niche in scientific data. PyTables is a Python interface to HDF5 files. The structure of an HDF5 file comprises of three types of entities, groups, tables and arrays (arrays will not be present in the discussion). These entities 3 are arranged into a tree, not unlike a file system. The problem Ultimately, my goal was to explore data through various programs and scripts I was yet to write. At that point I could not possibly know what exactly these programs would do. What I did know was how they might be accessing their data. For instance, retrieval of parts of the main matrix, along with indexes describing loaded fragments, was expected to be a frequent request. This perspective defined my first design. My grasp of the domain was still weak at the time. Knowing it, I consciously tried to ignore my inner prophet's predictions of how data structures would evolve. In principle, it is better to refrain from making generalizations if you do not have enough detailed knowledge from which to draw these generalizations. Still, some rudimentary abstractions were necessary. To insulate computations from technicalities of managing the data file itself, I put together a simple encapsulation a 'database connection'. My wrapper consisted of two classes. TableDef defined the database structure, while Notebook was responsible for 'housekeeping chores': most importantly, creating, opening and closing the file. Implementation of __getattr__ (listing 1) shows that no attempt was made to hide the structure of data 1 Whole genome studies frequently profile more then a thousand people using about several thousand SNPs. 2 The best place to start looking for tools of this kind is Scipy, a comprehensive resource for using Python in science. 3 In the sequel, I will often refer to these entities by a common moniker 'part'. 1

description

This article gives an example of use of reflection and Python’s metaclass idiom. The presented code is a minor extension to PyTables, a package for managing hierarchical datasets from Python.

Transcript of Structure versioning for PyTables

Page 1: Structure versioning for PyTables

Python, PyTables and meta-classes

Author: Marcin Swiatek, Visimatik Inc., 2007

Introduction

My task was to organize genotyping data for analysis. At the time I was working on a whole-genome association study. For a programmer, the sheer size of input datasets is one of defining characteristics of such projects. In this case I had over 70 GB of compressed text to deal with. Data for a WGA study is best pictured as a very large matrix, with columns corresponding to single nucleotide polymorphisms (SNPs). Rows correspond to samples or – in human terms, people who donated their genetic material for the study. Values in cells of this huge1 matrix indicate nucleotides found at these positions. Unfortunately, genotyping data does not necessarily arrive in a format convenient for analysis. More often it will have structure suitable for recording measurements and will need to be converted.

This document and the associated source code can be downloaded from www.visimatik.com.

The toolkit

Python and numpy are my tools of choice for this type of work. For data persistence, I would normally use pickling or XML, but given the volume of data, neither was appropriate. Due to the required form of data, a relational database would not fit the purpose, either. Enter HDF5 and PyTables 2 . HDF5 is a data format and a library for managing large data collections, with a niche in scientific data. PyTables is a Python interface to HDF5 files. The structure of an HDF5 file comprises of three types of entities, groups, tables and arrays (arrays will not be present in the discussion). These entities3 are arranged into a tree, not unlike a file system.

The problem

Ultimately, my goal was to explore data through various programs and scripts I was yet to write. At that point I could not possibly know what exactly these programs would do. What I did know was how they might be accessing their data. For instance, retrieval of parts of the main matrix, along with indexes describing loaded fragments, was expected to be a frequent request. This perspective defined my first design. My grasp of the domain was still weak at the time. Knowing it, I consciously tried to ignore my inner prophet's predictions of how data structures would evolve. In principle, it is better to refrain from making generalizations if you do not have enough detailed knowledge from which to draw these generalizations.

Still, some rudimentary abstractions were necessary. To insulate computations from technicalities of managing the data file itself, I put together a simple encapsulation a 'database connection'. My wrapper consisted of two classes. TableDef defined the database structure, while Notebook was responsible for 'housekeeping chores': most importantly, creating, opening and closing the file.

Implementation of __getattr__ (listing 1) shows that no attempt was made to hide the structure of data

1 Whole genome studies frequently profile more then a thousand people using about several thousand SNPs. 2 The best place to start looking for tools of this kind is Scipy, a comprehensive resource for using Python in science.3 In the sequel, I will often refer to these entities by a common moniker 'part'.

1

Page 2: Structure versioning for PyTables

storage: clients are allowed to directly access the catalog structure of the dataset. Likely, this would not be allowed to stand in the long run. However, this exercise is not about the mythical long run. It is chiefly about createDatabase and verifyDatabase routines. The former, invoked only when a new file is created, puts in place basic structures required by the application. The latter asserts that the opened HDF5 file appears to have these basics. The code in listing 1 is intended to give the idea of that first, sketchy solution. It is provided for illustration purposes only.

This system worked for me for several days, until I realized that additional data, describing relations between SNPs on the same chromosome, was needed. The new information came in the form of several large matrices, again with associated indexes. I needed to adjust my design and, since re-running the computation-heavy conversion was not practical, to upgrade my data files accordingly.

As one would expect, HDF allows modifications of data structures in existing files. There are several functions in Pytables, which manipulate data definition; I had already used some of them to create the original structures. What could be simpler than adding similarly shaped code injecting additions into the model? Well, the route of small, incremental changes has inherent problems. For instance, the new code would have to be very particular about verifying its preconditions, to check, for example, if the entity about to be created already is there. Managing data definitions through writing more and more imperative definitions is bound to result in something rather ugly as the software evolves. Consider the following argument:

Suppose, I would like to add a new table, defined by: class DistanceInChromsomes(tables.IsDescription): chromosome = tables.UInt8Col() source_name = tables.StringCol(itemsize = 256) array_set = tables.StringCol(itemsize = 256)

A simple conditional instruction would do splendidly:if '/source/hapmap/files' not in dbfile:

dbfile.createTable(/source/hapmap, 'files', TableDef.DistanceInChromosomes)

It is easy enough and it apparently works. So where is the problem?

Firstly, the definitions of tables' structures are separated from declaration of their position within the hierarchy.

Secondly, adding a new structure requires changing code in three (or maybe even four) places: creation, verification and 'upgrade' routines, on top of the definition itself. It would be better to have just one place to modify. Inevitably, if there are four places to update with every change, one will occasionally get missed.

Thirdly, while generic 'upgrade and verify' code may be more difficult to write, you need to write and test it just once. A sequence of nested if statements is simple enough to edit, but these additions will grow more and more difficult to review and test.

On top of all these concerns: writing code of that sort is incredibly boring and makes any good programmer suffer. In the spirit of increasing world happiness, I will propose an elegant solution to the problem at hand.

Pytables declares table structures in an interesting way. The function createTable will accept a class

2

Page 3: Structure versioning for PyTables

declaration4 as the parameter describing layout of the new table. I liked the idea for its apparent elegance and decided to extend it into a reusable mechanism tying these classes-definitions with their incarnations in the physical file. Not just tables, but also their place in the hierarchy ought to be deducted directly from declarations. This definition will be allowed to evolve along with the application. The necessary updates of the file structure will be taken care of as a matter of course.

The first stab

The Python idiom applied in Pytables' createTable, where a class definition is used to guide a computation5, requires a facility to examine that definition in runtime6. This facility is called 'reflection'. The next section gives a general outline of the concept.

Reflection

According to wiktionary, reflection is:

1. the act of reflecting or the state of being reflected

2. the property of a propagated wave being thrown back from a surface (such as a mirror)

3. something, such as an image, that is reflected

The dog barked at his own reflection in the mirror.

4. careful thought or consideration

After careful reflection, I have decided not to vote for that proposition.

5. an implied criticism

It is a reflection on his character that he never wavered in his resolve.

I like to think of the computer science term 'reflection' as relating more to 4 than to 1-3. More introspection7 and careful consideration, less mirror-gazing. Whatever the etymology, the crux of the concept is programs' ability to examine and possibly modify their own structure. It is a very interesting paradigm: programs that rewrite themselves; computer science does not get much more philosophical than that. Dynamic languages with dynamic type systems8, like Smalltalk, Lisp and. of course, Python, particularly encourage styles of programming relying on reflection.

The key to practical realization of reflection is an execution mechanism (a virtual machine) which preserves the organization of the program as it was seen by the programmer. This request is naturally meet in interpreted languages, but the concept is not confined to that realm anymore. Java is presently the most popular programming environment employing reflection. Both Java and its spiritual cousin C# are compiled and statically typed languages. But, unlike languages like C or C++, Java and C# target execution mechanisms which do have understanding of objects and classes. Consequently, the platform itself can, and does, provide the necessary means.

There is a notable difference between the view of reflection in Java and in Python. In statically-typed languages, where types have to be known in compile-time, one cannot evolve a type or change objects'

4 The class must be derived from tables.IsDescription. Consult Pytables manual for details. 5 In our case the computation results in writing something to a disk file. 6 I can already hear my inner C++ programmer (have I already confessed that?) arguing the merits of meta-programing C+

+ style. One could imagine generating code for creating the table from a class using macros and/or templates. 7 Be aware that the term 'introspection' has a specific meaning in computer science. See type introspection. 8 I do realize it sounds awful. But neither dynamic languages have to be dynamically typed, nor dynamic typing implies

dynamic execution model. Consult your favorite on-line source or your local computer science guru for more information.

3

Page 4: Structure versioning for PyTables

layout in runtime. Consequently, reflection APIs in .NET and in JRE9 do not allow alterations to definitions of class or functions in the runtime. Owing to its dynamic type system, Python permits programs to modify themselves during its execution.

The introspective code

Listing 2 presents the code I wrote at first. Be very suspicious of it! It plays the role of a negative example, as it well illustrates several pitfalls of manipulating class objects in run-time. But before the self-flagellation starts, a word about my design objectives.

First, the entire HDF5 file structure (excluding arrays10) should by defined by a class declaration. This premise begot two entities we will see in listing 2: TableDef and GroupDescriptor classes. The former is to group definitions of all entities in a HDF5 file, while the latter is the base class for declaring groups. In truth, croups could be readily deduced from tables' declarations, which from now on have to specify their place in hierarchy (objName). But I still thought it was better to have the option of defining them explicitly.

Next, I wanted to handle versions of the structure. Each declared part of the file structure bears a version number. Note that the Version attribute of TableDef and objVersion of classes defining tables have different roles to play. The number associated with the top level class is simply the version number of the entire database; each addition of a subordinate structure should trigger increase of Version attribute. The attribute objVersion of a particular table or groups shall equal to the version of TableDef from when that table was added to the data model. This will ease the processing of database upgrades, even if only partly11.

Finally, the mechanism will be designed for reuse.

The most interesting things happen in the __init__ function of TableDef. A newly created instance walks through its own class dictionary, reviewing all its members defined in the class' body. By design, tables and groups are represented by nested classes derived from designated bases; inheritance hierarchy of a class can be accessed in runtime through its '__bases__' variable. Once a table or a group declaration is found, it is put into the list of tables or groups, along with some auxiliary information. Having these lists, we can easily find a differential between two versions of the database or find the correct processing order for any manipulations. This order is defined by nesting hierarchy implied by values of objName attributes.

This simple code has just one shortcoming: it will not work. Your own “personal programmer's warning bell” should be ringing by now. Have a look at what is done when a declaration of a class inheriting from tables.IsDescription is encountered. I extract information from objName and objVersion fields of a class declaration and then I delete these fields! Evidently, this could work only the first time an object of TableDef class is created. On the next attempt, classes declaring tables' structures would be already

9 Admittedly, one can think up several ways of doing these things within the framework of .NET CLR (or in JRE), but outside the framework of statically typed Java and C#. This is certainly a real need, since the software of this kind is already emerging.

10 Arrays are different, because they do not have a structure to be defined up-front and they do not have to be declared before they are inserted. Since they confuse the picture, I have decided to conveniently exclude them from consideration for now.

11 Neither dropping of tables nor changes to tables' structure are supported by the code. It is quite apparent, though, how additions of fields and removal of tables and groups could easily be handled in this framework (as soon, as we get it to work). Data transforms are clearly out of scope of this simple example.

4

Page 5: Structure versioning for PyTables

stripped of this information. An obvious mistake, easy to catch in the first code review! However, as it turns out, this monstrosity would not work even once.

Unexpected complications

For now, let us ignore the impropriety of fiddling with the definition of a class in a method of its instance. This will be dealt with in time. Why is the program not working at all? As it turns out, attempts to access field objVersion in chld variable (we are still in __init__ method of TableDef) will often result in AttributeError exception. This may be surprising, since all members of the class that may pass type filters in if statements should have this attribute. Yet closer inspection in the debugger will reveal that there is no such field. It is as if this piece of information has been lost. It is easy to find, though: all variables assigned in class scope were turned into members of the columns collection (see Illustration 1, below). Overall, the object representing a class derived from IsDescription looks quite different from what you might expect. This is because the creators of Pytables have elected to customize the class creation process through the use of __metaclass__.

Earlier in the article we have discussed the notion of reflection. There is one important conclusion to draw from that section, which should be stated clearly: classes are objects, too. In Python classes are first-class objects, which means they can be used as any other objects in Python. In particular, they are at some point constructed.

Normally, classes are created by a call to a built-in function type(name, bases, dict) . This default behavior can be easily changed by substituting the factory of the class object: the class's metaclass. A metaclass (itself a class, too) has to have a special method called __new__, with signature matching that of type function above. All that remains is a single assignment to class's __metaclass__ variable in the body of its declaration: class metaMetaBase(type): def __new__(cls, classname, bases, classdict):

pass

class MetaBase: __metaclass__ = metaMetaBase

The substituted method can modify the created class12 or return some other representation altogether.

12 It is a dictionary; in Python all objects are dictionaries in a way.

5

Illustration 1: 'columns' collection (debugger screenshot)

Page 6: Structure versioning for PyTables

Stepping though the process of constructing any class derived from tables.IsDescription13, we get quickly to the implementation of the class metaIsDescription in description.py. There, the class dictionary is customized. All declared fields that 'look like' field names are grouped in a single item in the class dictionary: columns. In effect, instead of chld.objVersion, we end up with chld.columns['objVersion']. This explains why the original code does not work and hints at what needs to be done to fix it.

Firstly, I will either need to override the existing mechanism of creating IsDescription, or hack my way through nested declarations to extract objVersion and objName. Secondly, all processing of class declarations will need to be moved from instance code of TableDef (or whatever will replace it) to its class code.

After some thinking I decided to work around the nested dictionaries, rather than to insert another layer of inheritance. I wanted to keep my little overlay consistent with Pytables' documentation and allow deriving table definitions directly from IsDescription. Source code related to these changes is in listing 3 (note the lack of the usual 'suspicious code' disclaimer).

Form follows function

Now let's put it all together and examine, how the system has been laid out and how it works. The structure of an HDF5 file will be defined as a single class, derived from MetaBase (see Listing 4). Class TestTableDef (already presented in listing 3) is an example of such definition and contains several nested classes, declaring groups and tables. Although MetaBase is the main extension point of the mechanism and thus its most visible part, the bulk of the 'meta' functionality has been relegated to the class object's creator: metaMetaBase (listing 3). Upon loading the module with the definition of a class derived from MetaBase (for instance TestTableDef), the Python interpreter will attempt to create an object representing that class. TestTableDef does not define its own metaclass, but its base class does. Consequently, the __new__ function of metaMetaBase will be invoked and its return value will be used to represent TestTableDef class.

What metaMetaBase.__new__ does to the original representation is easy to interpret. First, notice is that it calls upon the base implementation by invoking: the_default = super(metaMetaBase, cls).__new__(cls, classname, bases, classdict)

The variable the_default is then modified, but in a rather non-abusive way, at least compared to the high standard set by tables.IsDescription. Information regarding groups and tables is gathered in collections, which will be made available to the class and instance routines of the class being created. It is worth pointing out that all class objects derived from IsDescription are stripped of objName and objVersion attributes; these fields are not real columns and that they need to be removed before being passed to routines creating tables.

There is still some very pretty Python syntactic sugar to be appreciated here. I apply it with the intent to make inheriting from MetaBase look as natural, as possible. Arguably, the natural way to use a class is through its instances; any recourse to the 'meta' level should remain an exception. The code of Listing 5 illustrates intended usage patterns: all access to the described system is channeled through plain invocations of instance methods of the __metadata member variable of the Notebook class. Results of these method calls are used by routines create, upgrade and verify. It would probably be acceptable to have the routines of Notebook contend with table information as created by metaMetaBase. But this would make these two classes logically coupled: whoever works with the Notebook class would have to know how to interpret values returned by calls to __metadata and, at times, how they were generated. In

13 Put a breakpoint on any assignments creating column specifications.

6

Page 7: Structure versioning for PyTables

a reusable mechanism, this is to be avoided. I found it best to complete the encapsulation of the mechanism by creating 'callable' objects, representing actions of creating tables and groups (listing 6; see use in 4).

'__metaclass__' domesticated

Python's __metaclass__ idiom allows overriding Python's default constructor of objects representing classes. This is an interesting semantic device and, although it certainly invites abuse, it can be very helpful at times. While there are many ways to define database structures in code, the proposed approach is elegant and, very importantly, offers strong encapsulation. Hence, classes which rearrange their own layout in unexpected ways may still appear 'normal' to the outside world and the use of reflection remains contained.

A reader interested in other uses of __metaclass__ should pick up a copy of the excellent Python Cookbook. Standard Python documentation outlines issues related to class creation. Several suggestions regarding potential application of metaclasses are given there, too.

Source code examples

Listing 1: ad-hoc wrapper classes

class TableDef: class SNPs(tables.IsDescription): name = tables.StringCol(itemsize = 32) chromosome = tables.UInt8Col() aminor = tables.UInt8Col() amajor = tables.UInt8Col()

class Individuals(tables.IsDescription): symbol = tables.StringCol(itemsize = 64) cohort = tables.UInt32Col() class Cohorts(tables.IsDescription): symbol = tables.StringCol(itemsize = 32) fIndivIdx = tables.UInt32Col() countIndiv = tables.UInt32Col() class Chromosomes(tables.IsDescription): symbol = tables.UInt8Col() fSNP_Idx = tables.UInt32Col() countSNP = tables.UInt32Col() class Chunk(tables.IsDescription): chromosome = tables.UInt8Col() cohort = tables.StringCol(itemsize = 8) chunk_name = tables.StringCol(itemsize = 256)

class Notebook: Version = 1.0 def __init__(self, name, path, write_access = True, feedback = Reporting.DummyFeedback()): """

7

Page 8: Structure versioning for PyTables

If the directory exists, attempt to open the file and the indice files (if these do not exist, we have an error). If there is no directory, attempt to create it (and all tables that should be there). """ # name will be a pickle with some basic info (just to mark the teritory) self.__paths = {} self.__paths["main"] = path self.__paths["pickle"] = os.path.join(path, name.strip() + ".pck") self.__paths["HD5"] = os.path.join(path, "data.HD5") self.__HD5 = None self.Stamp = None self.__GUI = feedback self.__ReadOnly = not write_access try: if os.path.isdir(path): self.Stamp = cPickle.load(open(self.__paths["pickle"], "r")) self.__HD5 = tables.openFile(self.__paths["HD5"], ["r", "r+"][self.__ReadOnly]) self.verifyDatabase(self.__HD5) elif not self.__ReadOnly: os.mkdir(path) self.Stamp = (name, self.version, time()) cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w")) self.__HD5 = self.createDatabase(self.__paths["HD5"]) if self.__HD5 is None: raise NotebookDoesNotExist(path, name) except IOError, eio: raise NotebookInvalidSource(eio) except tables.NoSuchNodeError, ensn: raise NotebookInvalidStructure(ensn) def close(self): if not self.__ReadOnly: self.Stamp = (self.Stamp[0], self.Version, time()) cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w")) self.__HD5.close() self.__HD5 = None self.Stamp = None def __del__(self): if not self.__HD5 is None: self.__HD5.close() self.__HD5 = None if not self.Stamp is None: self.Stamp = (self.Stamp[0], self.version, time()) cPickle.dump(self.Stamp, open(self.__paths["pickle"], "w")) self.Stamp = None def __getattr__(self, aname): return getattr(self.__HD5, aname) def verifyDatabase(self, dbfile): """ check if this is a proper database. If it is not, feel free to thow an exception """ dbfile.isVisibleNode("/analyses")

8

Page 9: Structure versioning for PyTables

dbfile.isVisibleNode("/metadata") dbfile.isVisibleNode("/definition") dbfile.isVisibleNode("/source") dbfile.isVisibleNode("/definition/chromosomes") dbfile.isVisibleNode("/definition/cohorts") dbfile.isVisibleNode("/definition/SNPs") dbfile.isVisibleNode("/definition/individuals") dbfile.isVisibleNode("/source/imports") def createDatabase(self, fpath): """ create a new HD5 file and its content """ dbfile = tables.openFile(fpath, "a") gAgg = dbfile.createGroup(dbfile.root, 'analyses', 'Information derived in analysis') gMeta = dbfile.createGroup(dbfile.root, 'metadata', 'Metainformation about performed analyses') gSubj = dbfile.createGroup(dbfile.root, 'definition', 'Subject data defiition') gRaw = dbfile.createGroup(dbfile.root, 'source', 'Raw source data uploaded with nbcp') dbfile.createTable(gSubj, 'chromosomes', TableDef.Chromosomes) dbfile.createTable(gSubj, 'cohorts', TableDef.Cohorts) dbfile.createTable(gSubj, 'individuals', TableDef.Individuals) dbfile.createTable(gSubj, 'SNPs', TableDef.SNPs) dbfile.createTable(gRaw, "imports", TableDef.Chunk) return dbfile def kill(self): self.close() assert not self.__ReadOnly for key in self.__paths.keys(): if "main" != key: os.remove(self.__paths[key]) os.rmdir(self.__paths["main"])

Listing 2 – the first version with reflection""" Some work on meta - structure (handling db creation and updates)"""import tables

class GroupDescriptor: pass

def getHierarchyFromPath(all_path): pieces = all_path.split("/") if len(pieces) > 1: return [ "/".join(pieces[:idx + 2]) for idx in xrange(len(pieces) - 1)] else: return []

class TableDef: Version = 1.1 class SNPs(tables.IsDescription): """SNPs table""" objName = "/definition/SNPs"

9

Page 10: Structure versioning for PyTables

objVersion = 1.0 name = tables.StringCol(itemsize = 32) chromosome = tables.UInt8Col() aminor = tables.UInt8Col() amajor = tables.UInt8Col()

class Individuals(tables.IsDescription): objName = "/definition/individuals" objVersion = 1.0 symbol = tables.StringCol(itemsize = 64) cohort = tables.UInt32Col() class Cohorts(tables.IsDescription): objName = "/definition/cohorts" objVersion = 1.0 symbol = tables.StringCol(itemsize = 32) fIndivIdx = tables.UInt32Col() countIndiv = tables.UInt32Col() class Chromosomes(tables.IsDescription): objName = "/definition/chromosomes" objVersion = 1.0 symbol = tables.UInt8Col() fSNP_Idx = tables.UInt32Col() countSNP = tables.UInt32Col() class Chunk(tables.IsDescription): objName = "/source/imports" objVersion = 1.0 chromosome = tables.UInt8Col() cohort = tables.StringCol(itemsize = 8) chunk_name = tables.StringCol(itemsize = 256) class DistanceChromsomes(tables.IsDescription): objName = "/source/hapmap/files" objVersion = 1.1 chromosome = tables.UInt8Col() source_name = tables.StringCol(itemsize = 256) array_set = tables.StringCol(itemsize = 256) class SourceGroup(GroupDescriptor): """Raw source data uploaded with nbcp""" objName = "/source" objVersion = 1.0 class AnalysesGroup(GroupDescriptor): """Information derived in analysis""" objName = "/analyses" objVersion = 1.0 class MetadataGroup(GroupDescriptor): """Metainformation about performed analyses""" objName = '/metadata' objVersion = 1.0 class DefinitionGroup(GroupDescriptor): """Subject data definition"""

10

Page 11: Structure versioning for PyTables

objName = '/definition' objVersion = 1.0 def __init__(self): classOf = self.__class__ # now list all dependant classes groups = [] tbls = [] for chld in classOf.__dict__.values(): try: bases = chld.__bases__ if GroupDescriptor in bases: description = chld.__doc__ name_parts = chld.objName.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) groups.append((short_name, path, count_parts, description, chld.objVersion)) elif tables.IsDescription in bases: description = chld.__doc__ # this code does not work. It is here for illustration purposes only # DO NOT ATTEMPT TO USE version = chld.objVersion del chld.objVersion objName = chld.objName del chld.objName name_parts = objName.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) tbls.append((short_name, path, count_parts, description, version, chld)) except AttributeError: pass # now find all groups that may be required (based on 'path' concept), but are not explicitly listed grdct = {} for gr in groups: created_grp = "/".join([gr[1], gr[0]]) grdct[created_grp] = gr[4] extra_groups = [] for gr in groups: extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])] for tb in tbls: extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])] # walk through all extra (potential) groups. If they are not there yet, add them to the list extra_groups.sort() prev_item = "" for gr in extra_groups: if gr[0] != prev_item and gr[0] not in grdct:

11

Page 12: Structure versioning for PyTables

prev_item = gr[0] name_parts = prev_item.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) groups.append((short_name, path, count_parts, None, gr[1])) self.Groups = groups self.Tables = tbls def __str__(self): outs = "Tables:\r\n" for table in self.Tables: outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4], type(table[5])) outs += "Groups:\r\n" for group in self.Groups: outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3]) return outs if __name__ == "__main__": metad = TableDef() print metad

Listing 3: essential parts of the implementation using __metaclass__

class metaMetaBase(type): def __new__(cls, classname, bases, classdict): the_default = super(metaMetaBase, cls).__new__(cls, classname, bases, classdict) if len(bases) == 0: return the_default # do not process the base class... groups = [] tbls = [] for chld in the_default.__dict__.values(): try: bases = chld.__bases__ if GroupDescriptor in bases: description = chld.__doc__ name_parts = chld.objName.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) groups.append((short_name, path, count_parts, description, chld.objVersion)) elif tables.IsDescription in bases: description = chld.__doc__ # this class has been transformed in pyTables through use of __metaclass__ idiom # need to get to the dictionary... version = chld.columns["objVersion"] del chld.columns["objVersion"] objName = chld.columns["objName"] del chld.columns["objName"]

12

Page 13: Structure versioning for PyTables

name_parts = objName.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) tbls.append((short_name, path, count_parts, description, version, chld)) except AttributeError: pass # now find all groups that may be required (based on 'path' concept), but are not explicitly listed grdct = {} for gr in groups: created_grp = "/".join([gr[1], gr[0]]) grdct[created_grp] = gr[4] extra_groups = [] for gr in groups: extra_groups += [ (spath, gr[4]) for spath in getHierarchyFromPath(gr[1])] for tb in tbls: extra_groups += [ (spath, tb[4]) for spath in getHierarchyFromPath(tb[1])] # walk through all extra (potential) groups. If they are not there yet, add them to the list extra_groups.sort() prev_item = "" for gr in extra_groups: if gr[0] != prev_item and gr[0] not in grdct: prev_item = gr[0] name_parts = prev_item.split("/") count_parts = len(name_parts) - 1 assert name_parts[0] == "" and name_parts[count_parts] != "" short_name = name_parts[count_parts] path = "/".join(name_parts[:count_parts]) groups.append((short_name, path, count_parts, None, gr[1])) #TODO - check if that works for more levels of inheritance (than just something derived from MetaBase) the_default.Groups = groups the_default.Tables = tbls return the_default

class MetaBase: __metaclass__ = metaMetaBase Version = 0.0 def __str__(self): outs = "Tables:\r\n" for table in self.Tables: outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4], type(table[5])) outs += "Groups:\r\n" for group in self.Groups: outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3])

13

Page 14: Structure versioning for PyTables

return outs

class TestTableDef(MetaBase): Version = 1.1 class SNPs(tables.IsDescription): """SNPs table""" objName = "/definition/SNPs" objVersion = 1.0 name = tables.StringCol(itemsize = 32) chromosome = tables.UInt8Col() aminor = tables.UInt8Col() amajor = tables.UInt8Col()

class Individuals(tables.IsDescription): objName = "/definition/individuals" objVersion = 1.0 symbol = tables.StringCol(itemsize = 64) cohort = tables.UInt32Col() class Cohorts(tables.IsDescription): objName = "/definition/cohorts" objVersion = 1.0 symbol = tables.StringCol(itemsize = 32) fIndivIdx = tables.UInt32Col() countIndiv = tables.UInt32Col() class Chromosomes(tables.IsDescription): objName = "/definition/chromosomes" objVersion = 1.0 symbol = tables.UInt8Col() fSNP_Idx = tables.UInt32Col() countSNP = tables.UInt32Col() class Chunk(tables.IsDescription): objName = "/source/imports" objVersion = 1.0 chromosome = tables.UInt8Col() cohort = tables.StringCol(itemsize = 8) chunk_name = tables.StringCol(itemsize = 256) class DistanceChromsomes(tables.IsDescription): objName = "/source/hapmap/files" objVersion = 1.1 chromosome = tables.UInt8Col() source_name = tables.StringCol(itemsize = 256) array_set = tables.StringCol(itemsize = 256) class SourceGroup(GroupDescriptor): """Raw source data uploaded with nbcp""" objName = "/source" objVersion = 1.0 class AnalysesGroup(GroupDescriptor): """Information derived in analysis""" objName = "/analyses" objVersion = 1.0

14

Page 15: Structure versioning for PyTables

class MetadataGroup(GroupDescriptor): """Metainformation about performed analyses""" objName = '/metadata' objVersion = 1.0 class DefinitionGroup(GroupDescriptor): """Subject data definition""" objName = '/definition' objVersion = 1.0

Listing 4: MetaBase, the base class for structure definition

class MetaBase: __metaclass__ = metaMetaBase Version = 0.0 def __str__(self): outs = "Tables:\r\n" for table in self.Tables: outs += "%s/%s ver %.1f (%s)\r\n" % (table[1], table[0], table[4], type(table[5])) outs += "Groups:\r\n" for group in self.Groups: outs += "%s/%s ver %.1f (%s)\r\n" % (group[1], group[0], group[4], group[3]) return outs def getVersionDifferential(self, base_version): """get all items that are newer than the base version. This is a list of callables that needs to be executed against the database""" lobj = [] lobj += [ tpl for tpl in self.Groups if tpl[4] > base_version] lobj += [ tpl for tpl in self.Tables if tpl[4] > base_version] lobj.sort(lambda a, b : cmp(a[2], b[2])) return [self.callableFactoryFunction(tpl) for tpl in lobj] def getCreateSequence(self): return self.getVersionDifferential(0.0) def getVersionPaths(self, version): lobj = [] lobj += [ tpl for tpl in self.Groups if tpl[4] <= version] lobj += [ tpl for tpl in self.Tables if tpl[4] <= version] lobj.sort(lambda a, b : cmp(a[2], b[2])) return [("%s/%s" % (tpl[1], tpl[0])) for tpl in lobj] def callableFactoryFunction(self, tpl): if len(tpl) == 5: return CreateGroupWrapper(tpl[1], tpl[0], tpl[3]) elif len(tpl) == 6: return CreateTableWrapper(tpl[1], tpl[0], tpl[5], tpl[3]) else: return None

15

Page 16: Structure versioning for PyTables

Listing 5: The main wrapper class (Notebook)

class Notebook: """ THIS IS AN ILLUSTRATION ONLY. CODE NOT RELEVANT TO THE EXAMPLE HAS BEEN LEFT OUT """ def version(self): ver = 0.0 if not self.__ReadOnly: ver = self.metadata().Version elif not self.Stamp is None: ver = self.Stamp[1] # kludge - past sins if ver == "0.1.0": ver = 1.0 return ver def upgradeDatabase(self, db, oldstamp): ver = oldstamp[1] assert not self.__ReadOnly cmdlst = self.metadata().getVersionDifferential(ver) if len(cmdlst): for cmd in cmdlst: cmd(db) return oldstamp[0], self.version(), time() else: return oldstamp def verifyDatabase(self, dbfile): """ check if this is a proper database. If it is not, feel free to thow an exception """ lst = self.metadata().getVersionPaths(self.version()) assert len(lst) > 0 for path in lst: assert path in dbfile def metadata(self): if self.__metadata is None: self.__metadata = TableDef() return self.__metadata def createDatabase(self, fpath): """ create a new HD5 file and its content """ dbfile = tables.openFile(fpath, "a") createl = self.metadata().getCreateSequence() for cmd in createl: cmd(dbfile) return dbfile

Listing 6: callable wrapper classes for part metadata

class CreateTableWrapper:

16

Page 17: Structure versioning for PyTables

def __init__(self, path, name, defobj, dsc): self.Path = ["/", path][path is not None and path != ""] self.Name = name self.DefClss = defobj self.Title = ["", str(dsc)][dsc is not None] def __call__(self, fobj): fobj.createTable(self.Path, self.Name, self.DefClss, self.Title) def __str__(self): return "table %s=>%s (%s)" % (self.Path, self.Name, self.Title)

class CreateGroupWrapper: def __init__(self, path, name, dscr): self.Path = ["/", path][path is not None and path != ""] self.Name = name self.Title = dscr def __call__(self, fobj): fobj.createGroup(self.Path, self.Name, self.Title) def __str__(self): return "group %s=>%s (%s)" % (self.Path, self.Name, self.Title)

17