Post on 25-May-2015
Pairtrees for object storageJohn Kunze and Stephen Abrams, California Digital Library (CDL)
A pairtree maps ids to paths,two characters at a time
A pairtree is a filesystem hierarchy that uses an identifierstring to derive an object directory (or folder) location
• The derivation takes successive pairs of characters andcreates a succession of directories, called a pairpath
ab2def3 ⇒ ab/2d/ef/3/• A pairpath ends at directory containing an object’s files;
most systems do variation of this (is variation needed?)• Reverse the mapping to find all ids/objects in a pairtree;
pairpath termination rules permit variable length ids
Pre-converting problematic charactersSome identifier characters are inconvenient or illegal in
filenames and must be hex-encoded (e.g., *→^2a) id: what-the-*@?#! → what-the-^2a@^3f#! ⇒ wh/at/-t/he/-^/2a/@^/3f/#!
But to keep paths short, 3 common chars are converted to 3rare chars (at cost of complexity): /→= :→+ .→,
id: ark:/13030/xt12t3 → ark+=13030=xt12t3 ⇒ ar/k+/=1/30/30/=x/t1/2t/3/
The deadly embrace• Digital repositories tend to require a surrender of storage
transparency that creates unhealthy system dependency• Internally objects are often broken up so that they can be
difficult to piece together in case of trouble
Fig. 1. Object storage should notneed a fearful entanglement withsoftware. Since objects have tobe parked in a filesystem beforerepository software upgrade, whatif we left them in there and builtour repositories around them?
Pairtree credits and detailsPairtree specification:
www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txtwww.cdlib.org/inside/diglib/pairtree/pairtreespec.html
Authors from CDL and University of Michigan (UM):Martin Haye, Erik Hetzner, John Kunze, Mark Reyes,and Cory Snavely; many thanks to Stephen Abrams,Sebastien Korner, Brian Tingle, et al
SummaryPairtree is the thinnest smear we can add to our very well-
understood filesystems and their universal tools (theuniversal “API”) to create a very well-understood,platform-independent object storage substrate
Pairtree is not a complete repository system, but it iscomplete for object storage and makes it easier to buildsystems and to share objects between institutions
Why pairs of characters?Taking two chars at a time balances path depth and
fanout (number of possible entries in any directory)• Example: ab2def3 ⇒ ab/2d/ef/3/• Each pair, letters+digits, has 36x36 possibilitiesCompared to taking one char at a time• Only 36 possibilities, but path depth grows rapidly• Example: ab2def3 ⇒ a/b/2/d/e/f/3/At another extreme, taking seven characters at a time• Short paths, but 78 billion (367) possible items• Example: ab2def3 ⇒ ab2def3/
For further informationPlease contact jak@ucop.edu or stephen.abrams@ucop.eduFor information on CDL’s Preservation Program, see http://www.cdlib.org/programs/digital_preservation.html
Jim B L
Pairtree origins include• Prototype: UCSF tobacco controldocuments and CDL digitized books• Early production: digitized booksfor UM and Hathi Trust
cyocum
Objects in a pairtreeA pairtree is especially useful if, for each contained object,
all of the object’s parts, and nothing but its parts, areenclosed in the object’s directory
Import such a pairtree and, knowing nothing about theobjects’ structure and semantics, you can reliably
• Enumerate all objects and their identifiers• Produce any object by requested id• Maintain and back it up with ordinary OS tools• Rebuild the collection in case of database corruption
simply by walking the pairtreeTo walk a pairtree requires knowing path termination rules• A pairpath terminates when you reach a file or reach a
directory name with 1 char or more than 2 chars ab/ \--- cd/ |--- foo/ | | README.txt | | thumbnail.gif | |--- master_images/ | | | ... | | | \--- gh/ \--- e/ \--- bar/ | metadata | 54321.wav | index.html
Fig. 2. Example pairtree containing two objects:abcd and abcde. The first object is enclosed indirectory foo/, the second in bar/. While foo/does not subsume e/ at the same level, byenclosure, it does subsume the gh/ underneath it.
Sample software implementationhttp://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm
A Perl module that implements two mappings: id2ppath() takes anid into a pairpath and ppath2id() performs the inverse mapping.