Platypus: Design and Implementation of a Flexible High Performance Object...

27
Platypus: Design and Implementation of a Flexible High Performance Object Store. Zhen He 1 , Stephen M Blackburn 2 , Luke Kirby 1 , and John Zigman 1 1 Department of Computer Science Australian National University Canberra ACT 0200, Australia Zhen.He, Luke.Kirby, John.Zigman @cs.anu.edu.au 2 Department of Computer Science University of Massachusetts Amherst, MA, 01003, USA [email protected] Abstract. This paper reports the design and implementation of Platypus, a trans- actional object store. The twin goals of flexibility and performance dominate the design of Platypus. The design includes: support for SMP concurrency; stand- alone, client-server and client-peer distribution configurations; configurable log- ging and recovery; and object management which can accommodate garbage col- lection and clustering mechanisms. The first implementation of Platypus incorpo- rates a number of innovations. (1) A new recovery algorithm derived from ARIES that removes the need for log sequence numbers to be present in store pages. (2) A zero-copy memory-mapped buffer manager with controlled write-back behavior. (3) A data structure for highly concurrent map querying. We present performance results comparing Platypus with SSM, the storage layer of the SHORE object store. For both medium and small OO7 workloads Platypus outperforms SHORE across a wide range of benchmark operations in both ‘hot’ and ‘cold’ settings. 1 Introduction This paper describes the design and implementation of an object store with a flexible architecture, good performance characteristics, and a number of interesting implemen- tation features such as a new recovery algorithm, an efficient zero-copy buffer manager, and scalable data structures. Furthermore, this work represents a step towards our goal of both meeting the demands of database systems and efficiently supporting orthogo- nally persistent run-time systems. The key to orthogonal persistence lies in the power of abstraction over persistence, the value of which is being recognized by a wider and wider audience. The potential for orthogonal persistence in mainstream data management appears to be massive and yet largely unrealized. As appealing as orthogonal persistence is, it must be applicable to the hard-edged data management environment in which the mainstream operates before it will see uptake in that world. Traditionally, object stores for orthogonally persistent languages have not strongly targeted this domain, so strong transactional semantics and good transaction throughput have not normally been goals or features of such stores.

Transcript of Platypus: Design and Implementation of a Flexible High Performance Object...

Page 1: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Platypus: Design and Implementation of a Flexible HighPerformance Object Store.

ZhenHe1, StephenM Blackburn2, LukeKirby1, andJohnZigman1

1 Departmentof ComputerScienceAustralianNationalUniversityCanberraACT 0200,Australia�

Zhen.He, Luke.Kirby, John.Zigman � @cs.anu.edu.au2 Departmentof ComputerScience

Universityof MassachusettsAmherst,MA, 01003,[email protected]

Abstract. Thispaperreportsthedesignandimplementationof Platypus,atrans-actionalobjectstore.Thetwin goalsof flexibility andperformancedominatethedesignof Platypus.The designincludes:supportfor SMP concurrency; stand-alone,client-server andclient-peerdistribution configurations;configurablelog-gingandrecovery;andobjectmanagementwhichcanaccommodategarbagecol-lectionandclusteringmechanisms.Thefirst implementationof Platypusincorpo-ratesanumberof innovations.(1) A new recoveryalgorithmderivedfrom ARIESthatremovestheneedfor logsequencenumbersto bepresentin storepages.(2) Azero-copy memory-mappedbuffer managerwith controlledwrite-backbehavior.(3) A datastructurefor highly concurrentmapquerying.WepresentperformanceresultscomparingPlatypuswith SSM, the storagelayer of the SHOREobjectstore.For bothmediumandsmallOO7workloadsPlatypusoutperformsSHOREacrossa widerangeof benchmarkoperationsin both‘hot’ and‘cold’ settings.

1 Introduction

This paperdescribesthe designandimplementationof an objectstorewith a flexiblearchitecture,goodperformancecharacteristics,anda numberof interestingimplemen-tationfeaturessuchasanew recoveryalgorithm,anefficientzero-copy buffer manager,andscalabledatastructures.Furthermore,this work representsa steptowardsour goalof both meetingthe demandsof databasesystemsandefficiently supportingorthogo-nally persistentrun-timesystems.

Thekey to orthogonalpersistencelies in thepowerof abstractionoverpersistence,thevalueof whichis beingrecognizedby awiderandwideraudience.Thepotentialfororthogonalpersistencein mainstreamdatamanagementappearsto bemassive andyetlargely unrealized.As appealingasorthogonalpersistenceis, it mustbe applicabletothehard-edgeddatamanagementenvironmentin which themainstreamoperatesbeforeit will seeuptake in thatworld. Traditionally, objectstoresfor orthogonallypersistentlanguageshavenotstronglytargetedthisdomain,sostrongtransactionalsemanticsandgoodtransactionthroughputhave not normally beengoalsor featuresof suchstores.

Page 2: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

On theotherhand,relational,objectrelational,andobjectorienteddatabaseshave notfocusedstronglyonthedemandsof atightly coupledlanguageclientandsotendto pro-videaparticularlyinefficientfoundationfor orthogonallypersistentsystems.Oneresultis a lackof systemsthatcaneffectively bridgethatgap.We havedevelopedPlatypusinanattemptto addressthatgap.

Thebackdropto thedesignof Platypusis the transactionalobjectcachearchitec-turewhichstronglyseparatesstorageandlanguageconcerns[Blackburn1998].Centralto thearchitectureis theconceptof bothstoreandruntimesharingdirectaccessto anobjectcache,whereaccessis moderatedby a protocolmanifestin a transactionalin-terface.This architectureprovidesa layer of abstractionover storagewhile removingimpedancebetweenthestoreandthelanguageruntimeimplementations.

TransactionalInterfaceCache

Object Store

ApplicationProgram

LanguageRun-time

(a) The transactionalobject cachear-chitecture.Thecacheis directly acces-siblesharedmemory.

server� store client�cache�app.� RTS

store client�cache�app.� RTS

store client�cache�

app.� RTS

store client�cache�

app.� RTS

(b) Underlying store architecturesaretransparentto the client runtime(RTS).

Fig. 1. The transactionalobjectcachetightly couplesthestoreandruntime/applicationthroughdirectaccessto a cache,while abstractingover thestorearchitecture.

A focusonabstractionis reflectedin thedesignof Platypus,whichhastheflexibilityto supporta rangeof storagelayer topologies(suchasstand-alone,client-server, andclient-peer).The designalso includesreplaceablemodularcomponents,suchas theobjectmanager, which is responsiblefor providing a mappingbetweentheunderlyingbytestore1 andtheobjectstore projectedto theclient. This flexibility is critical to theutility of Platypusasit providesasinglesolid framework for prototypingandanalyzingmostaspectsof storedesign.

The focuson minimizing store/runtimeimpedanceimpactsstronglyon the imple-mentationof Platypusandled directly to thedevelopmentof a zero-copy buffer cacheimplementation.The zero-copy buffer cachepresentsa numberof technicalhurdles,notablywith respectto therecoveryalgorithmandcontrollingpagewrite-back.1 Weusethetermbytestore to describeuntyped,unstructureddata.

Page 3: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Theconstructionof aflexible, robust,highperformancetransactionalstoreis ama-jor undertaking,andin our experienceis onefilled with problemsandpuzzles.In thispaperwetouchon justa few aspectsof theconstructionof Platypus.Weaddressthear-chitectureof thesystembecausewe believe thatit embodiesdecisionsthatwerekey todeliveringbothflexibility andperformance.We alsoaddresssomemajorperformanceissuesthat arosein the courseof implementation,aswe believe that the solutionswehave developedarenovel andarelikely to find applicationin othersystems.For otherdetailsof theconstructionof Platypuswhich arenot novel, suchastheintricatework-ingsof theARIES[Mohan1999;Mohanetal. 1992]recoveryalgorithmandthechoiceof buffer cacheeviction policies,the readeris referredto the substantialliteratureondatabaseandstoredesignandimplementation.

In section2 Platypusis relatedto existingwork. Section3 describesthearchitectureof Platypus.Sections4 and5 focuson implementationissues,first with respectto therealizationof an efficient zero-copy buffer manager, andthentechniquesthat greatlyenhancethescalabilityof Platypus.Section6 presentsandanalyzesperformanceresultsfor PlatypusandSHOREusing both the OO7 benchmark[Carey et al. 1993] and asimplesyntheticworkload.Section7 concludes.

2 Related Work

This paperis written in thecontext of a largenumberof objectstoredesignandimple-mentationpapersin thePOSworkshopseriesandavastnumberof databaseandOODBdesignandimplementationpapersin thebroaderdatabaseliterature.

Architecture A numberof papershavereportedon thedesignof objectstorestargetedspecificallyat orthogonallypersistentprogramminglanguages.Brown andMorrison[1992] describea layeredarchitecturethat wasusedasthe basisof a numberof per-sistentprogramminglanguages,includingNapierandGalileo.Centralto thearchitec-tureis anabstractstableheap(alsoa featureof theTycoonarchitecture[Mattheset al.1996]).Thismodelofferedacleanabstractionoverpersistencein thestorearchitecture,and the architecturesweresufficiently generalto admit a wide rangeof concurrencycontrolpolicies.However thestableheapinterfaceis fine grainedbothtemporallyandspatially(readword, write word, etc.)andsoincurssubstantiallymoretraversalsof theinterfacethana transactionalobjectcache,which is coarsegrainedspatially(perobjectc.f. perword)andtemporally(pertransactionc.f. peraccess).

Of themany storearchitecturesin theliteraturethatdo not directly targetorthogo-nal persistence,mostaddressspecificoptimizationsandimprovementswith respecttoOODBdesign.TheTHORarchitecturefrom MIT [Liskov etal.1999]hasbeenthecon-text for OODBresearchin theareasof buffer management,recoveryalgorithms,cachecoherency andgarbagecollection.Otherprojectshave supporteda similar breadthofwork in the context of an OODB architecture.The systemthat comesclosestto thearchitecturalgoalsof Platypusis SHORE[Carey et al. 1994],a successorto EXODUSand E [Carey et al. 1988], which set out to be flexible (with respectto distribution,for example),focussedon performance,andsupporteda persistentprogramminglan-

Page 4: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

guage.SHOREis publicly availableandhasbeenusedasthe point of comparisoninourperformanceanalysisof Platypus(section6).

By contrastto the systemsbuilt on the stableheap[Brown and Morrison 1992;Mattheset al. 1996], Platypusis built aroundthe transactionalobject cacheabstrac-tion [Blackburn andStanton1998], which allows direct accessto a cachesharedbystorageandlanguageruntimelayers(althoughit limits thelanguage-sideclientto trans-actional concurrency control). Platypusis thuscloserto orthodoxOODBsthansuchsystems.Onenotabledistinctionis thatwhile many OODBMs focuson scalabilityofclient-server (andin thecaseof SHORE,client-peer)architecture,Platypus’sapproachto scalability extendstowardsSMP nodes—eitherstand-alone,or in client-server orclient-peerconfigurations.

Buffer Cache Implementation Most recentwork on buffer managementhasfocusedon efficientuseof buffersin a client-servercontext. KemperandKossmann[1994] ex-ploredadaptive useof both objectandpagegrain buffering—objectbuffering incursa copy cost,but may be substantiallymorespace-efficient whenobjectsarenot wellclustered.TheTHOR project[Liskov et al. 1999]hasincludeda lot of work on buffermanagementin client-serverdatabases,mostlyfocusingonefficiently managingclient,server, andpeermemoryin the faceof objectupdatesat the clients.Platypus’s buffercacheis in most respectsorthodox (implementinga STEAL/NO-FORCE[Franklin1997] buffer managementpolicy), and is amenableto suchtechniquesfor managingdatacachedvia anetwork. Thenovelty in our buffer managerlies in its useof memorymappingto addressefficiency with respectto accessof dataresidenton disk. We ad-dresstheproblemof controllingwrite-backof mappedpageswhich havebecomedirty(any suchwrite-backmustbestrictly controlledto ensurerecoverability).We alsoout-line anextensionto theARIES[Mohanet al. 1992]recoveryalgorithmwhichobviatestheneedfor log sequencenumbers(LSNs)to beincludedon everypage.

Scalable Data Structures Thereexistsanenormousliteratureon datastructuresandscalability. Knuth [1997]providesanexcellentdiscussionof sortingandsearchingthatprovedto bevery helpful in our implementationefforts.Kleiman,Smalders,andShah[1995] give an excellenttreatmentof issuesrelatingto threadsandconcurrency in anSMPcontext, describingnumeroustechniquesfor maximizingscalabilitywhich wererelevant to our storeimplementation.Oneof thescalabilityproblemswe facedwasintheefficiency of mapstructures.After exploring splaytrees[SleatorandTarjan1985],andhashes(includingdynamichashes[Larson1988])andtheir variouslimitations,wedevelopedanew datastructurewhichis asimplehybrid,thehash-splay. Thehash-splaydatastructureis explainedin full in section5.1.

3 Architecture

At thecoreof thePlatypusarchitecturearethreemajorseparableabstractions:visibil-ity, whichgovernstheinter-transactionalvisibility of intra-transactionalchange,stabil-ity, which is concernedwith coherentstability anddurability of data,andavailability

Page 5: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

which controlstheavailability of data(i.e., its presenceor absencein a user-accessiblecache)[Blackburn 1998].Eachof thesecorrespondsto oneor moremajor functionalunits (called‘managers’)in the Platypusarchitecture.On top of thesethreeconcernsthereexistsanobjectmanager, whichprojectsanobjectstorefrom anunderlying‘bytestore’,anda transactionmanager, whichorchestratesvisibility, stabilityandavailabilityconcernsonaccountof userrequestsacrossthetransactionalinterface.Thisarchitectureis illustratedschematicallyin figure2.

stability visibility

Transaction Manager

Object Manager

Availability Manageravailability

transactional interface

Global

Vis. MLocal

Vis. MLow LStab.M

Stab.MHigh L

storelog

high level log

network

(byte store)

(object store)

memory

runtime/application

direct memory access

Fig. 2. Platypusarchitectureschematic.Managers(right) overseethestructureandmovementofdatabetweenmemory, a store,a log, andremotestores(left). Throughthetransactionalinterfacetheclient runtimegainsdirectaccessto thememoryit shareswith thestore.

3.1 Visibility

Isolation is a fundamentalpropertyof transactions[HarderandReuter1983]. Whilechangesmadeby anuncommittedtransactionmustbeisolatedfrom othertransactionsto ensureserializability, serializabilityalsodemandsthatthosechangesbefully exposedoncethetransactionis successfullycommitted.Visibility is concernedwith controllingthatexposurein thefaceof particulartransactionalsemantics.Visibility becomessome-whatmorecomplex in thefaceof distribution,wherea rangeof differentstrategiescanbeemployedin aneffort to minimizetheimpactof latency[Franklin1996].Sincethesestrategies areconcernedsolely with efficiently maintainingcoherentvisibility in thefaceof distribution-inducedlatency, they arenot relevant to purely local visibility is-sues.We thuscarefully separatevisibility into two distinct architecturalunits: a localvisibility manager (LVM), andaglobal visibility manager (GVM).

The LVM is concernedwith the mediationof transactionalvisibility locally (i.e.,within asingleuniformly addressablememoryspace,or ‘node’). In theabsenceof net-work latency, optimismserves little purpose,so the LVM canbe implementedusing

Page 6: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

anorthodoxlock manager(asfoundin mostcentralizeddatabasesystems).2 TheLVMoperateswith respectto transactionsandabstractvisibility entities, which thetransac-tion managerwould typically mapdirectly to objectsbut whichcouldmapto any otherentity over which visibility control were required.Throughthe transactionmanager,transactionsacquirelocksfrom theLVM, which ensuresthatappropriatetransactionalvisibility semanticsarepreservedanddetectsandreactsto deadlockswhennecessary.

TheGVM managestransactionalvisibility betweentheLVMs at distributednodes(i.e., nodeswhich do not sharea uniformly addressablememory).TheGVM may im-plementany of the considerablerangeof transactionalcachecoherency algorithms[Franklin 1996]. The GVM operatesbetweendistributed LVMs and works with re-spectto transactions, visibility entities, andsetsof visibility entities. By introducingsetsof visibility entities, interactionscanoccurat a coarsergrain.Justasvisibility en-tities may be mappedto objects,setsof visibility entitiesmay be mappedto pages,allowing global visibility to operateat eitherpageor objectgrain (or both [Vorugantietal. 1999]).TheGVM allowsaccessrightsovervisibility entities(objects)andsetsofvisibility entities(pages)to becachedatnodes.Oncerightsarecachedatanode,intra-node,inter-transactionvisibility is managedby theLVM at thatnode.TheGVM existsonly to supporttransactionalvisibility betweendistributedLVMs, sois only necessaryin thecasewheresupportfor distribution is required.

3.2 Stability

Durability is anotherfundamentalpropertyof transactions.HarderandReuter[1983]definedurability in termsof irrevocablestability, that is, the effectsof a transactioncannotbe undoneoncemadedurable.An ACID transactioncanonly beundonelogi-cally, throughtheissueof anothertransaction(or transactions)which countersthefirsttransaction.Thereare other importantaspectsto managingstability, including crashrecovery, faulting of datainto memoryand the writing of both committed(durable)anduncommitteddatabackto disk.Giventhepervasivenessandgeneralityof thewriteaheadlogging(WAL) approachto recovery[Mohan1999],wemodelstability in termsof a storeanda stablelog of events(changesto thestore)3. This modelallowsuncom-mittedstoredatato besafelystabilized,asthelog alwaysmaintainssufficient informa-tion to bringthestoredatato astatethatcoherentlyreflectsall committedchangesto thestore.Stability functionalityis dividedinto two abstractlayersmanagedrespectively bythehigh level stabilitymanager (HSM) andthe low level stabilitymanager (LSM).

The HSM is responsiblefor maintainingandensuringthe recoverability of ‘highlevel’ logs which stably reflect changesto the storeinducedby useractions(as op-posedto actionsinternal to Platypus’s management).For examplea user’s updateto2 Optimismis a computationaldevice for hiding latenciesassociatedwith thein-availability of

informationon which a decisionmustbe made(typically dueto physicaldislocationor de-pendencieson incompleteconcurrentcomputations).In a context wherethe information isavailable—suchis thecasewith sharedmemoryconcurrency control—optimismis unneces-saryandin factdeleterious(thereis no point in computing‘optimistically’ wheninformationindicatingthefutility of thatcomputationis immediatelyavailable).

3 Therewereotherimportantreasonsfor choosingWAL, not leastof theseis thatunlike shad-owing, WAL admitsthepossibilityof a zero-copy buffer cache(seesection4.1).

Page 7: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

an object,O, is recordedby the HSM in relatively abstractterms:‘transactionT mu-tatedobjectO thus:O��� O � ∆’. The HSM’s log of changesis robust to lower leveleventssuchasthemovementof anobjectby a garbagecollectoroperatingwithin thestorebecausetheHSM log doesnot recordthephysicallocationof byteschanged,butratherrecordsdeltasto abstractlyaddressedobjects.Anotherrole of the HSM is thatit manages‘before-images’of objectswhich the userhasstatedan intention to up-date.Beforeimagesserve two importantroles.They arenecessaryfor the calculationof changedeltas(∆ � O� O), andthey allow updatesto the storeto be undone.Thelatter is madenecessaryby our desirefor a zerocopy objectcacheandthepossibilityof implementinga STEAL policy in the buffer cache(i.e., that uncommittedmutatedpagesmight bewritten to thestore[Franklin 1997]).TheHSM ensuresthatdeltasaremadestableat transactioncommit time, andthatbefore-imagesaremadestablepriorto any pagebeing‘stolen’ from thebuffer cache.

TheLSM hastwo roles.It maintainsandensuresthe recoverability of ‘low level’logs which stably reflect changesto storemeta-dataandstorestructure(suchas themovementof an object from one disk pageto anotheror the updateto someuser-transparentindex structure).The LSM alsoencapsulatesHSM logs. Recovery is pri-marily the responsibilityof the LSM. Whenthe LSM finds a log recordcontainingaHSM log record,it passesresponsibilityfor that updateto the HSM. This separationhastwo majorbenefits.First, it allowsany low-level activitiessuchaschangesto objectallocationmeta-datato be completelyseparatedfrom the high level object storebe-havior, which allows modular, independentimplementation.Second,it opensthewayfor a two-level transactionsystemwhich canreduceor remove concurrency conflictsbetweenlow-level activities and high level activities (this is essentialto minimizingconflictsbetweenconcurrentusertransactionsandgarbagecollection).For exampleahigh-level transactionmaybeupdatinganobjectwhile it is beingmovedby a low-leveltransaction.Thetwo-level systemallowsbothtransactionsto proceedconcurrently.

3.3 Availability

Theavailability manager (AM) hasa numberof roles,all of which relateto thetaskofmakingdataavailable(to theuserandthestoreinfrastructure).Thecenterpieceof theAM is thusthe buffer cache—the(bounded)setof memorypagesin which the storeoperates.For eachnodethereexists oneAM.4 The AM managesthreecategoriesofpages:store pageswhich aremappedfrom secondarystorage(throughthe LSM) orfrom a remotenode(throughtheGVM), log pageswhich aremappedfrom secondarystorage(throughthe LSM) andshared meta-datawhich is commonto all processesexecutingat thatnode.Whenever thereis demandfor a new page,a frameis allocatedin the buffer cache(possiblycausingthe eviction of a residentpage).In the caseof astorepage,theLSM or GVM thenfaultsthepageinto thenew frame.Evictedpagesareeitherswappedto their correspondingstorepage(apageSTEAL), or to a specialswapareaon disk (for non-storepages).5

4 TheAM is managedcooperatively by all storeprocessesexecutingconcurrentlyon thenode.5 This is simplyachievedby ‘mmap’ing non-storepagesto a swapfile.

Page 8: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

The AM operatesat two grains:available regions, which are typically mappedto objects,andpages, which is the operatingsystem’s unit of memorymanagement.Throughthetransactionmanager, theAM keepstrackof which memoryregionsareinuse(pinned)by theuser, andwhich regionsaredirty, informationwhich is usedwhenmakingdecisionsaboutwhich pagesto evict.

3.4 Objects

An objectmanager (OM) is responsiblefor establishing(andmaintaining)a mappingbetweenanunderlying‘byte store’(unstructureddata)andtheobjectstoreprojectedtotheuser. Thenatureof thismappingandthesophisticationof theOM is implementationdependentandmayvary enormously, from a systemthattreatsobjectsasuntypedbytearraysandsimply mapsobject identifiersto physicaladdresses,to a systemthat hasa strongnotion of object type andincludesgarbagecollectionandperhapsa level ofindirectionin its objectaddressing.Thestrengthof this designis thatwhile theobjectmanageris arelatively smallcomponentof thestorein implementationterms,it largelydefinesthestorefrom theuserperspective.Thusmakingthisamodular, separablecom-ponentgreatlyextendstheutility of theremainderof thestore.

3.5 Transactions

The transactionmanager is a relatively simplearchitecturalcomponentthat sits be-tweenthetransactionalinterfaceandtheremainderof thearchitecture.It executesusertransactionsby takinguserrequests(suchasreadIntention, writeIntention, new, com-mit, andabort), andorchestratingtheappropriatetransactionalsemanticsthroughtheOM (creatingnew objects,lookingupOIDs),AM (ensuringthattheobjectis in cache),LVM (ensuringthat transactionalvisibility semanticsareobserved),andHSM (main-tainingthepersistentstateof theobject).

Having describedthePlatypusarchitecture,we now discussaspectsof its implementa-tion.

4 Implementing a Zero-Copy Buffer Cache

At the heartof the transactionalobjectcachearchitectureis the conceptof both storeandruntimesharingdirectaccessto an objectcache,whereaccessis moderatedby aprotocolmanifestin a transactionalinterface(seefigure 1). This contrastswith inter-facesthatrequirethatobjectsbecopiedfrom thestoreto theruntimeandbackagain.Inourrecentexperiencewith acommercialdatabaseproductandtheSHORE[Carey etal.1994]objectdatabase,neitherwouldallow direct(‘in-place’) updatesto objects.6 Sucharequirementis understandablein thefaceof normaluseraccessto anobjectdatabase.If direct,unprotected,accessis givento acacheddatabasepagethegravepossibilityex-istsof theuserbreakingtheprotocolwith anupdatethat(inadvertently)occurrsoutside6 We wereableto circumvent this limitation in SHORE—at theexpenseof recoverability (this

is describedin section6).

Page 9: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

anobject’s boundsandoverwritescritical metadataor anobjectfor which thetransac-tion hadnot declareda write intention.Furthermore,theefficient implementationof adirectaccessinterfacepresentssomechallenges.

In pursuitof performance,Platypusis implementedwith a zero-copy buffer cache.By zero-copy we meanthat disk pagesare memorymapped(avoiding copying thatoccursin stdio routines),andthat thestoreclient (languageruntime)is givendirectaccessto the mappedmemory, client updatesbeingwritten directly into the mappedfile. Of coursea particularlanguageruntimemay well chooseto copy objectsbeforeupdatingthem(thishappensin ourorthogonallypersistentJava implementation,whereeachobjectmustbecopiedinto theJavaheapbeforeit canbeused).However, wehavenonethelessremovedonelevel of copying from thestore’scritical path.

Unfortunatelythe implementationof a zero-copy buffer cachein the context of astandardUnix operatingsystempresentstwo significanthurdles:controllingthemove-mentof databetweenmemoryanddisk,anddealingwith log sequencenumbers(LSNs)in multi-pageobjects.

4.1 Controlled Write-Back in a Memory Mapped Buffer under POSIX/UNIX

While themmap systemcall providesameansof accessingthefilesystemwithoutcopy-ing, mmap doesnot allow re-mapping(i.e. oncedatais mappedfrom disk addressd tovirtual addressv it canonly bewritten to someotherdisk addressd � via a copy). Fur-thermore,while msync canbeusedto forcea mappedpageto bewritten backto disk,the only meansof stoppingdatafrom beingwritten backin a conventionaluser-levelUnix processis throughmlock, which is a privilegedcommandthat bindsphysicalmemoryto aparticularmapping.Thispresentsaserioushurdle.Withoutusingmlock,any direct updatesto an mmappedfile could at any time be written backto disk, de-stroying thecoherency of thediskcopy of thepage.

We seePlatypus’s primary role as the back-endto a persistentlanguageruntimesystem,andthereforedo not think that it would be satisfactoryfor storeprocessestorequireroot privileges.A standardsolutionin Unix to problemsof this kind is to usea privilegeddaemonprocessto performprivilegedfunctionsasnecessary. Using thisapproach,assoonasa non-residentpageis faultedin, anmlock daemoncouldbetrig-geredwhichwould lock thepagedown, preventingany write-backof thepageuntil thelock wasreleased.7 Controlwould thenrevert to thestoreprocess.

The useof a daemonprocesscomesat the costof two context switches(contextmustmovefrom thestoreprocessto thedaemonandback).Givenour focuson perfor-mance,this couldpresenta significantproblem.Fortunately, thecontext switchcanbecompletelyoverlappedwith thelatency associatedwith faultingin thepagewhich is tobelocked.Thusmlock semanticsareattainedwithout requiringstoreprocessesto beprivilegedandwithoutanappreciableperformancepenalty. Theapproachis sketchedinfigure3. Whenastoreprocessrequestsapageto befaultedinto thestore,it first signalsa conditionvariablethatthemlock daemonis waitingon, thenit triggerstheI/O. TheI/O causesthestoreprocessto blockandthusgiving themlock deamon(which is nownolongerblockedonaconditionvariable)achanceto beswitchedin. A lock-protected

7 Throughtheuseof mlock andmprotect, residency is underthecontrolof Platypus.

Page 10: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

placerequestin mlock circularbuffer blockedonconditionvariablecall mmap blockedonconditionvariable

signalmlockdaemon conditionvariableunblockedtriggerI/O – control � inspectcircular buffer

blockedon I/O for eachnon-emptyentryin circularbufferblockedon I/O call mlock & setdoneflagI/O completes � control– finished [time sliceconsumed]

checkmlockdoneflag blockedonconditionvariable [sleep]

[block onconditionvariable – control � continueprocessingcircular buffer]store process mlock daemon

Fig. 3. OverlappingI/O with context switchesto a daemonto efficiently usemlock in a non-privileged storeprocess.Accessto the mlock circular buffer is donein a mutually exclusivemanner. Eventsin brackets([] ) depictscenariowheredaemondoesnotcompleterequest.

circular buffer is usedto dealwith the possibility of multiple storeprocessessimul-taneouslymakingrequeststo the mlock daemon,anda flag is usedto allow the storeprocessto ensurethatthemlock did in facttakeplacewhile it wasblockedontheI/O.If thedaemonhasnot run andthemlock hasthusnot occurred,thestoreclient blocksonaconditionvariablewhichis signalledby thedaemonwhenit completesthemlock.Themlock daemonis centralto thesuccessfulimplementationof thezero-copy buffercachein Platypus.

4.2 Avoiding LSNs in a Write-Ahead Logging Recovery Algorithm

Write-aheadlogging (WAL) algorithms(suchasARIES [Mohan 1999;Mohanet al.1992]) revolve aroundtwo key persistententities,a store anda log. Thelog is usedtorecordchanges(deltas)to the store.This allows I/O to be minimizedby only writingstorepagesbackto disk opportunisticallyor whenno otherchoiceexists.8 Abstractly,theWAL protocoldependson somemeansof providing linkagebetweenthestoreandthe log at recovery time. Suchlinkage provides the meansof establishingwhich ofthechangesrecordedin the log have beenpropagatedto thestoreandsoareessentialto the recovery process.Concretely, this linkageis providedby log sequencenumbers(LSNs), which areessentiallypointersinto the log [Mohan et al. 1992]. The naturalway to implementLSNs is to includethemin storepages—astorepagecanthenbereaduponrecoveryandits statewith respectto thelog quickly establishedby checkingtheLSN field of thepage.

LSNshave thelessdesirableside-effectof punctuatingthepersistentstoragespacewith low-level recovery-relatedinformation.Thisperturbationis particularlynoticeablein the context of objectsthatarelarger thana pagebecausetheobjectwill have to bebrokenup to makeway for anLSN oneachof thepagesthatit spans(acoherentobject8 Notethat it is becauseWAL maintainsonly onestorepagecopy (unlike shadow-paging)that

a zero-copy buffer cacheis possible.Shadow-pagingessentiallyrequiresa remappingof thememory storerelationshipeachtimeapageis committed,whichprecludeszero-copy useofmmap asdescribedabove.

Page 11: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

imagewill thusonly beconstructibleby copying eachof thepartsinto anew contiguousspace).More generally, this useof LSNsprecludeszero-copy accessto objectsof anysizethatspanpages.Motivatedby our goalof a zero-copy objectstoreandthe possi-bility of anobjectspacenot interruptedby low-level recoverydata,wenow describeanalternative approachthatavoids theneedfor explicit LSNs to bestoredin storepageswithout loosingthesemanticsthatLSNsbring to widely usedWAL algorithmssuchasARIES.

Our approachfollows an interestingremotefile synchronizationalgorithm,rsync[Tridgell 1999;Tridgell andMackerras1998],whichmakescleveruseof rolling check-sumsto minimizetheamountof datarequiredto synchronizetwo files.Checksumsareusedby rsyncto identify which partsof thefiles differ (so thatminimal deltascanbetransmitted),andtherolling propertyof rsync’schecksumalgorithmallowschecksumsto becheaplyandincrementallycalculatedfor a largenumberof offsetswithin oneofthetwo filesbeingcompared.

At recovery time in WAL systemsthe log may containmany updatesto a givenstorepage.Thusthereexistsa many-to-onerelationshipbetweenlog recordsandstorepages.A critical partof therecoveryprocessis reducingthisto aone-to-onerelationshiplinking eachstorepagewith thelog recordcorrespondingto thelastupdateto thatpagebeforeit wasflushed.An LSN storedin thefirst wordof thestorepageprovidesatrivialsolutionto theproblem—theLSN is a pointerto the relevantrecord.By associatingachecksumwith eachlog record,theproblemis resolvedby selectingthelog recordcon-taininga checksumwhich matchesthestateof thestorepage.This amountsto a minorchangeto therecoveryprocessanddependsononly asmallchangeto thegenerationoflog recordsduringnormalprocessing.Eachtime a log recordis produced,the check-sumsof all updatedpagesarecalculatedandincludedin thelog record.Thustheuseofchecksumsin placeof LSNsis in principleverysimple,yet it avoidsthefragmentationeffect that LSNs have on large objects(onesthat spanmultiple pages).However thissimplechangeto theWAL algorithmis mademorecomplex by two issues:thecostofchecksumcalculation,andtheproblemthatchecksumsmaynot uniquelyidentify pagestate.We now addresstheseconcernsin turn.

Cheap Checksum Calculation The 32-bit Adler rolling checksum[Deutsch1996]efficiently maintainsthe checksum(or ‘signature’) for a window of k bytesas it isslid acrossa region of memoryonebyteat a time. Thealgorithmcalculateseachnewchecksumasa simplefunctionof thechecksumfor the lastwindow position,thebytethatjustenteredthewindow, andthebytethatjust left thewindow. It is bothcheapandincremental.

In an appendixto this paper, we generalizethe Adler checksumto partial check-sums(which indicatethecontribution to thetotal checksummadeby somesub-partofthechecksummedregion).Wethenshow thatin additionto its propertyof incremental-ity, theAdler checksumhasamoregeneralassociativepropertythatallowschecksumsto be calculatedusingdelta checksums. A deltachecksumis simply a partial check-sumcalculatedwith respectto the byte-wisedifferencebetweenold andnew values.The new checksumfor a modifiedregion is equalto its old checksumplus the deltachecksumof thebytesthatweremodified.

Page 12: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Thusratherthanrecalculatethechecksumfor a wholepage,it is only necessarytoestablishthe deltachecksumsfor any updatedobjectsandaddthemto the page’s oldchecksum.Furthermore,thecalculationof thechecksumis very cheap(involving onlyanaddanda subtractfor eachbyte),andcanbefoldedinto thecalculationof thedeltawhich mustbegeneratedfrom thebeforeandafter imagesof thechangedobjectpriorto beingwritten to the log at commit time. A derivationof the incrementalchecksumandapseudo-codedescriptionareincludedanappendixto thispaper.

Dealing With Checksum Collisions It is essentialthat at recovery time the stateofa storepagecanbe unambiguouslyrelatedto exactly oneandonly one(of possiblymany) log recordsdescribingchangesto thatpage.Checksumcollisionscanarisefromtwo sources.First,althoughany two consecutivechangesmust,by definition,leavethepagein differentstates,a subsequentchangecould returnthe pageto a previousstate(i.e.A � B � A). Secondly, two storepagestatescanproducethesamechecksum.The32-bit Adler checksumusedin Platypushasan effectivebit strengthof 27.1 [Tayloretal. 1998],whichmeansthattheprobabilityof two differingregionssharingthesamechecksumis approximately1 � 227. Soa checksumaloneis inadequate.

To thisendweintroducepageflushlog records(PFLR)to thebasicWAL algorithm.Positive identificationof a page’s log statecanbesimplified if for eachtime a pageisflushedto disk a PFLRis written to the log—eachflushof a storepageis followedbya log recordrecordingthatfact.Theproblemthenreducesto oneof determiningwhichof two statesthestorepageis in. EitherboththeflushedpageandtheassociatedPFLRwhichfollows it areondiskor only theflushedpagemadeit to diskbeforea failureoc-curred.If thePFLRcontainsthepage’schecksum,determiningwhich statethesystemis in is trivial unlessthereis a checksumcollision.However, a checksumcollision canbetrivially identifiedat thetime thata PFLRis createdby rememberingthechecksumusedin the lastPFLRfor thepagein questionandidentifying any occurrenceof con-secutivePFLRswhichcontainthesamepagechecksum.As longasconsecutivePFLRsfor agivenpageholduniquechecksums,thestateof astorepagecanbeunambiguouslyresolved.

If the pagestateis the same(i.e. the stateof the pagein consecutive flushesisunchanged),then the pageneednot actually be flushedand a flag can be set in thesecondPFLRindicatingthat thepagehasbeenlogically flushedalthoughits statehasnot changedsincethe last flush. When collisions do occur (i.e. samechecksumbutdifferentstorestate),they canbedisambiguatedby appendingthesecond(conflicting)PFLR with the full stateof all regionsof the pagethat have beenmodifiedsincethelastpageflush.A conflict betweenthetwo PFLRsis thensimply resolvedby checkingwhetherthestorepagestatematcheswith thestateappendedto thesecondPFLR.Whilein the worst casethis could involve including the entirepagestatein the log record(if the entirepagewerechanged),checksumcollisionsoccurso infrequentlythatanyimpacton overall I/O performancewill be extremelysmall (on averageonly 1 in 227

pageflushescouldbeexpectedto generateany suchoverhead,andtheWAL protocolalreadyminimizestheoccuranceof any pageflushesby usingthelog ratherthanpageflushesto recordeachcommit).

Page 13: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Thecommon-caseI/O effect of usingchecksumsinsteadof in-pageLSNsis negli-gableasthetotalnumberof byteswritten to disk in bothcasesis similar. In thecaseofthechecksum,storepageutilization improvesslightly becausethereis no spacelost toLSNs,aneffect thatis compensatedfor by includingthechecksumin eachlog record.9

ThestorepageandPFLRflushescanbeasynchronoussolong asthePFLRis flushedfrom thelog beforethenext flushof thesamestorepage.This is guaranteed,however,becauseasubsequentpageflushcanonly ariseasaresultof anupdateto thatpage,andundertheWAL protocolany suchupdatemustbeloggedbeforethepageis flushed,andtheloggingof suchanupdatewill forceall outstandingPFLRsin thelog to bewrittenout.However, on therareoccasionwhencollisionsoccur, thePFLRmustbewritten tothelog andthelog flushedprior to flushingthestorepage.

Althoughthechecksumapproachrequiresthatchecksumsberecalculateduponre-covery, thetimetakento performanAdler checksumof apageis dominatedby memorybandwidth,andsowill besmall in comparisonto theI/O costassociatedwith bringingthe pageinto memory. The checksumtechniquewill thereforenot significantlyeffectperformanceon therareoccasionthatrecoverydoeshave to beperformed.

PagechecksumsandPFLRscanthusbeusedto replaceintrusiveLSNsfor thepur-posesof synchronizingstoreandlog stateatrecoverytime. However, LSNsmaybeusedfor otherpurposesduringnormalexecutionsuchasreducinglockingandlatching[Mo-han1990].The useof conventionalLSNs during normal executionis not effectedbytheuseof PFLRsbecauseLSNsmaystill beassociatedwith eachpagewhile thepageresidesin memory(in datastructureswhichholdothervolatilemetadatamaintainedforeachpage).ThusalgorithmslikeCommit LSN [Mohan1990]remaintrivially applica-blewhenchecksumsandPFLRsareusedinsteadof in-pageLSNs.

4.3 Checksum-Based Integrity Checks

A sideeffect of maintainingchecksumsfor eachpageis thatpageintegrity checkscanbe performedif desired.Suchcheckscanbe usedfor debuggingpurposesor simplyto increaseconfidencein thestoreclient’s adherenceto the transactionalobjectcacheprotocol.The first of thesegoalsis met by a function that checkspageson demandandensuresthatonly areasfor which theclienthadstatedawriteIntention weremodi-fied.Someonedebuggingastoreclient couldthenusethis featureto identify any pageswhichhadinadvertentlybeencorrupted.Thesecondgoalis metby checkingeachpageprior to flushingit to disk,checkingthatonly thoseregionsof thepagewith a declaredwriteIntention hadbeenupdated.Theassociativepropertyof theAdler checksum(seeaboveandtheappendixto this paper),makestheefficient implementationof suchpar-tial checksumsefficient andstraightforward.Becauseof the checksum’s probabilisticbehavior, suchcheckscanonly beusedto checkintegrity to a veryhigh level of confi-dence, not asabsolutechecksof integrity.

9 LSNsarenot normallyexplicitly includedin log recordsbecausethey areencodedasthe logoffsetof thelog record[Mohanet al. 1992].

Page 14: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

5 Maximizing Scalability

Performanceis a key motivator for Platypus,so it follows that scalability is a majorimplementationconcern.After briefly outliningtheapproachwehavetakento scalabil-ity, we describein somedetail two techniquesthatweresignificantin deliveringgoodperformanceresultsfor Platypus.

At the highestlevel, the storearchitecturemustbe able to exploit scalablehard-ware.For this reasonthePlatypusarchitecturesupportsdifferentdistributedhardwaretopologiesincludingclientserverandclientpeer(figure2 andsection3.3).Thesecondof thesehasbeendemonstratedto scaleextremelywell on large scalemulticomput-ers[Blackburn 1998].In additionto distributedmemorymachines,thePlatypusarchi-tectureis alsotargetedat SMP platforms,wherememorycoherency is not a problembut resourcecontentioncanbea majorissue.Throughtheuseof daemonprocessesallstoreprocesseson a nodecansharethe commondataandmeta-datathrougha buffercachethatismmapedto thesameaddressin eachprocessandcooperativelymaintained.

At a lower level, we foundtwo primarysourcesof in-scalabilitywhenbuilding andtestingPlatypus,datastructuresthatwouldnotscalewell, andresourcecontentionbot-tlenecks. Wesuccessfullyaddressedall majorresourcecontentionbottlenecksby usingtechniquessuchaspoolsof locks,which provide a way of balancingtheconsiderablespaceoverheadof a POSIX lock with the major contentionproblemsthat canariseiflocksareappliedto datastructuresat too coarsea grain[Kleimanet al. 1995].We nowdescribethehash-splay, aspecificsolutionto akey datastructurescalabilityproblem.

5.1 Hash-Splay: A Scalable Map Data Structure

Map datastructuresare very importantto databasesand object stores.Platypusim-plementsa numberof largemapstructures,mostnotablytheLVM hasa maprelatingobject identifiers(OIDs) to per-objectmeta-datasuchaslocks, updatestatus,andlo-cation.Prior to the developmentof Platypus,we implementeda PSI-compliantobjectstorebasedon the SHORE[Carey et al. 1994] object database,and found that maplookupswereabottleneckfor SHORE.Consequentlywespenttimeanalyzingthescal-ability characteristicsof commonmapstructures,mostnotablyhashtablesandsplaytrees[SleatorandTarjan1985],eachof which have someexcellentproperties.Knuth[1997]givesadetailedreview of bothhashesandbinarysearchtrees.

Primaryfeaturesof the memoryresidentmapsusedby Platypusandotherobjectstoresarethataccesstime is atapremiumand they havekey setsof unknown cardinal-ity. Thesamemapstructuremayneedto efficiently mapno morethana dozenuniquekeys in thecaseof aquerywith a runningtime in theorderof a few hundredmicrosec-onds,andyet may needto mapmillions of keys for a large transactionthat will takemany seconds.The challengethereforeis to identify a datastructurethat cansatisfyboththevariability in cardinalityandthedemandfor fastaccess.Anothercharacteristicof thesemapsis thatthey will oftenexhibit adegreeof locality in their accesspatterns,with a ‘workingset’of w keyswhich areaccessedmorefrequently(suchaworking setcorrespondsto thelocality propertiesinherentin theobjectstoreclient’s objectaccesspatterns).

Page 15: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

While binary searchtreesare ideal for mappingkey setsof unknown cardinality,theiraccesstimeis O � logn� for n keys.Ontheotherhand,hashtablescandeliveraccesstime performanceof O � 1� , but arenot well suitedto problemswherethecardinalityofthekey setis unknown.

Ourinitial experimentswerewith splaytrees,whichareself-adjustingbinarysearchtrees.Splay treesre-adjustat every access,bringing the most recentlyaccesseditemto the root. While this providesexcellentperformancefor accesspatternsof the type�aaaaabbbbbba.. . � , splay treesperform particularly poorly with an accesspattern

like�abababababab.. . � . We quickly found that by using ‘sampling’ (the tree is re-

adjustedwith someprobability p), there-adjustingoverheaddroppedappreciablywithno appreciabledegradationto accessperformance,providing a substantialoverall per-formanceimprovement.This ideahasbeenindependentlydevelopedandanalyzedbyothers[Furer1999].

Othershave experimentedwith linear hashing10 which adaptsto the cardinalityof the key set.This was originally developedfor out-of-memorydata[Litwin 1980;Larson1980]andsubsequentlyadaptedto in-memorydata[Larson1988].Asidefromthequestionof key setcardinality, therealityof non-uniformhashfunctionsmeansthatasignificanttradeoff mustbemadebetweenspaceandtimefor any hashwhichtakestheconventionalapproachof linear-probingon its buckets.Accesstime canbeminimizedby reducingcollisionsthrougha broad(andwasteful)hashspace,or spaceefficiencycanbe maximizedat the costof relatively expensive linear searchesthroughbucketswhencollisionsdo occur. Significantly, theimpactof a accesslocality on thebehaviorof splaysandhashesappearsnot to have beenwell examinedin the literature,rather,performanceanalysestendto addressrandomaccesspatternsonly.

Theseobservationsleadus to the hash-splay, a simpledata-structurewhich com-binestheO � 1� accesstime of a hashwith theexcellentcachingpropertiesandrobust-nessto key setsof unknown cardinalitythatsplaytreesexhibit. Thehash-splayconsistsof ahashwherethehashtargetsaresplaytreesratherthanothermoretraditionalbucketstructures.11

The hash-splaydatastructurehasexcellentcachingproperties.The combinationof the re-adjustingpropertyof the splay treesand the O � 1� accesstime for the hashmeansthat for a working set of size w, averageaccesstimes for items in the set isapproximatelyO � log � w� k��� wherek representsthe hashbreadth.By contrast,for asimplesplay it is approximatelyO � logw� , and for a simple linear probinghashit isO � n� k� . If k is chosento be greaterthan w, anda suitablehashfunction is chosen,working setelementsshouldbe at the rootsof their respective buckets,yielding near-optimal accessfor theseitems.Significantly, k is selectedasa function of w, not n,allowing k to besubstantiallysmaller.

Platypussuccessfullyimplementsthe hash-splayto greateffect. We find that ourdatastructuresperformvery well for both hugetransactions(wheremany locks andobjectsareaccessed),aswell asfor smalltransactions(seesection6.2).

10 Not to beconfusedwith thestandardtechniqueof linear probing in hashtables.11 c.f. hash-trees[Agrawal andSrikant1994],whicharetreeswith hashesat eachnode.

Page 16: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

6 Performance Analysis of Platypus

We conducteda numberof experimentsto assessthe performanceof Platypus.Foreachof theexperimentswe useSHOREasa point of comparison.After describingtheexperimentalsetup,we comparethe overall performanceof the two stores.We thencomparethe performanceof eachstore’s logging system.The comparative readfaultperformanceof thetwo storeswasthenext experimentconducted.Finally we comparethescalabilityof thedatastructuresusedby eachstorefor readingandlocking objects.

6.1 Experimental Setup

All experimentswereconductedusingSolaris7 onanIntel machinewith dualCeleron433Mhz processors,512MB of memoryanda4GB harddisk.

Platypus The systemusedin this evaluationwasa first implementationof Platypus.The featuresit implementsinclude:a simpleobjectmanagerwhich featuresa directmappingbetweenlogical andphysicalIDs; a local visibility manager;both low levelandhighlevel logging;andanavailability manager. Thepagebuffer implementedin theavailability managersupportsswappingof storeandmeta-datapages,andimplementstheNO-FORCE/STEALrecoverypolicy [Franklin1997].Thefeaturesdescribedin sec-tions4 and5 haveall beenimplementedwith theexceptionof checksum-basedintegritychecks.Thesystemis SMPre-entrantandsupportsmultiple concurrentclient threadsper processaswell asmultiple concurrentclient processes.Limitations of the imple-mentationat the time of writing include: the failure-recovery processis untested(al-thoughlogging is fully implemented);theavailability managerdoesnot recyclevirtualaddressspace;thestorelacksa sophisticatedobjectmanagerwith supportfor garbagecollection;noGVM hasbeenimplemented,precludingdistributedstoreconfigurations.

SHORE SHOREis a persistentobjectsystemdesignedto serve the needsof a widevarietyof targetapplications,includingpersistentprogramminglanguages[Carey etal.1994].SHOREhasalayeredarchitecturethatallowsusersto choosethelevelof supportappropriatefor a particularapplication.The lowest layer of SHOREis the SHOREStorageManager(SSM),which providesbasicobjectreadingandwriting primitives.UsingtheSSMwe constructedPSI-SSM,a thin layerproviding PSI[Blackburn 1998]compliancefor SSM.By usingthePSIinterface,thesameOO7codecouldbeusedforboth PlatypusandSHORE.SSM comesasa library that is linked into an applicationeitherasa client stub(for client-serveroperation)or asa completesystem—welinkedit in to PSI-SSMasa completesystemto avoid the rpc overheadassociatedwith theclient-server configuration.We alsobypasstheSSM’s logical recordidentifiers,usingphysicalidentifiersinsteadin orderto avoid expensive logical to physicallookupsoneachobjectaccess.Thissimplemodeof objectidentificationcloselymatchesthesimpleobjectmanagercurrently implementedin Platypus.We perform in-placeupdatesandreadsof objects12 which further improvestheSHOREresultsreportedin this section.

12 SSMdoesnot supportin-placeupdates,sowhena write intentionis declaredwith respecttoanobject,PSI-SSMtakesabefore-imageof theobjectto enabletheundoof any updatesif the

Page 17: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

ThePSI-SSMlayeralsohastheeffect of cachinghandlesreturnedby SSMon a readrequest,reducingthenumberof heavy-weight readrequeststo SSMto oneper-objectper-transaction,assubsequentreadrequestsfor the sameobjectare trappedby PSI-SSM.PSI-SSMusesthe hash-splaydatastructureto implementits light-weight map.We usedSHOREversion2.0 for all of theexperimentsreportedin this paper.

OO7 TheOO7benchmarksuite[Carey et al. 1993]is a comprehensivetestof OODBperformance.OO7 operationsfall into threecategories,traversalswhich navigatetheobject graph,querieslike thosetypically found in a declarative query languageandstructuralmodificationswhich mutatethe objectgraphby insertinganddeletingob-jects.The OO7 schemais modeledaroundthe componentsof a CAD application.Ahierarchyof assemblyobjects(referredto asa module)sits at the top level of the hi-erarchy, while compositepartobjectslie at thebottomof theschema.Eachcompositepart objectin turn is linked to the root of a randomgraphof atomicpartobjects.TheOO7 databaseis configurable,with threesizesof databasecommonlyused—small,medium, and large. In our studywe usesmall (11,000objects)andmedium(101,000objects)databases.Theresultsof OO7benchmarkrunsareconsideredin termsof hotandcold results,correspondingto transactionsrun on warm andcold cachesrespec-tively. Operationsarerun in batchesandmayberun with manytransactionsperbatch(i.e., eachiterationincludesa transactioncommit),or asonetransaction(i.e., only thelastiterationcommits).All of theresultsreportedherewererunusingthemanyconfig-uration.For a detaileddiscussionof thebenchmarkschemaandoperations,thereaderis referredto the paperandsubsequenttechnicalreportdescribingOO7 [Carey et al.1993].We usethesameimplementationof theOO7benchmarkfor bothPlatypusandSHORE.It is written in C++ andhasexplicit readandwrite barriersembeddedin thecode.This implementationdoesnot dependon storesupportfor indexesandmaps,butinsteadimplementsall necessarymapsexplicitly.

While weacknowledgethelimitationsof usingany benchmarkasthebasisfor per-formanceanalysis,theoperationsincludedin theOO7suiteemphasizemany differentaspectsof storeperformance,sotheperformanceof a systemacrosstheentiresuiteofoperationscangive a reasonableindicationof the generalperformancecharacteristicsof thatsystem.

6.2 Results

Overall Performance Figure4 shows the relative performanceof Platypuswith re-spectto SHOREfor thetwenty-fouroperationsof theOO7benchmarkfor bothhotandcold transactionswith smallandmediumdatabases.Of the9213 results,Platypusper-formssignificantlyworsein only five5 (5.4%),theresultsaremarginalin 9 (9.8%),and

transactionis aborted.While this is safewith respectto aborts,a STEAL of a pageupdatedthis way will lead to storecorruption.This optimizationthereforecomesat the expenseofrecoverability in a context whereswapping(STEALing) is occurring.

13 Wewereunabletousethelargedatabaseconfiguration(andunabletogetresultsfor t3bandt3con themediumconfiguration)becauseof failuresin PSI-SSMand/orSSMoncethedatabasesizeexceededtheavailablephysicalmemory. This behavior maywell berelatedto our useofin-placeupdateswhich is unsafein thefaceof pageSTEAL.

Page 18: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Platypusperformssignificantlybetterin 78 (84.8%).Thegeometricmeansof thetimeratiosfor thefour setsof resultsrangefrom 0.580to 0.436,correspondingto PlatypusofferingaverageperformanceimprovementsoverSHOREfrom 72%to 129%.Thehotresultsspendproportionatelymoretime in theapplicationandperformingsimpleOIDto virtual addresstranslations(correspondingto applicationreads).SincebothsystemsexecutethesameapplicationcodeandthePSI-SSMlayerperformsOID to virtual ad-dresstranslationsusinga hash-splay(saving SHOREfrom performingsuchlookups),thereis appreciablylessopportunityfor theadvantagesof PlatypusoverSHOREto beseenin thehot results.

Logging Performance Table1 outlinestheresultsof a comparisonof loggingperfor-mancebetweenPlatypusandSHORE.The resultsweregeneratedby re-runningtheOO7 benchmarksuiteusingboth systemswith logging turnedoff (i.e., without writeI/O at commit time),andthensubtractingtheseresultsfrom thosewith loggingturnedon.Thedifferencebetweentheresultsyieldsthelogging(write I/O) costfor eachsys-tem.ThetablerevealsthatPlatypusperformsloggingfarmoreefficiently thanSHORE,which maybeattributedto thezerocopy natureof mmap (usedto allocatememoryforlog recordsandusedto flush log recordsto disk in Platypus).The betterresultsforthesmalldatabasesuggeststhatloggingin Platypusis particularlylow latency (any la-tency effect will bediminishedin themediumdatabaseresults,wherebandwidthmaydominate).This resultalsosupportstheclaim thatthenew recoveryalgorithm(includ-ing incrementalchecksumcalculations)presentedin section4.2 is efficient in termsofloggingcosts.

Time RatioPlatypus/SHOREStoreSize Hot Cold

Small 0.290(0.0141) 0.396(0.0169)Medium 0.362(0.0225) 0.401(0.0181)

Table 1. Performancecomparisonof logging betweenPlatypusandSHORE,showing thegeo-metricmeansof ratiosof loggingtimesfor selectedoperationsfor eachsystem.Only timesfor thebenchmarkoperationswhereat least1% of executiontime wasspenton loggingwereincludedin the results(8 operationsfor mediumresultsand15 operationsfor small results).The resultsshown areaverageresultsfrom 3 repetitions.Thevaluesin brackets() depictthecoefficient ofvariationof thecorrespondingresult.

Read Fault Performance To comparehow well the two systemsperformreadI/O,we subtractedthehot time from thecold time for eachbenchmarkoperation.This dif-ferencerepresentsthecostof faultingpagesinto a cold cache,andany incidentalper-transactioncosts.Thegeometricmeansof theratiosbetweentheresultsfor eachstorearepresentedin table2. The useof mmap both reducesthe numberof pagescopiedduring pagefault to zeroandalso reducesmemoryconsumption.Reducingmemoryconsumptionhasthesecondaryeffectof deferringtheneedfor swappingandsofurtherreducingbothreadandwrite I/O.

Page 19: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

0.25

0.5

1

2

4

q1 q2 q3 q4 q5 q6 q7 q8 t1 t1w t2a t2b t2c t3a t3b t3c t4 t5d t5u t6 t7 t8 t9 wuTim

e ra

tio A

NU

/SH

OR

E (

log)

Query

(a) Execution time ratios for cold benchmarkruns over the OO7 smalldatabase(geometricmean= 0.495,with 0.0156coefficient of variation).

0.25

0.5

1

2

4

q1 q2 q3 q4 q5 q6 q7 q8 t1 t1w t2a t2b t2c t3a t3b t3c t4 t5d t5u t6 t7 t8 t9 wuTim

e ra

tio A

NU

/SH

OR

E (

log)

Query

(b) Executiontimeratiosfor hotbenchmarkrunsovertheOO7smalldatabase(geometricmean= 0.517,with 0.0141coefficient of variation).

0.25

0.5

1

2

4

q1 q2 q3 q4 q5 q6 q7 q8 t1 t1w t2a t2b t2c t3a t4 t5d t5u t6 t7 t8 t9 wuTim

e ra

tio A

NU

/SH

OR

E (

log)

Query

(c) Execution time ratios for cold benchmarkruns over the OO7 mediumdatabase(geometricmean= 0.436,with 0.0142coefficient of variation).

0.25

0.5

1

2

4

q1 q2 q3 q4 q5 q6 q7 q8 t1 t1w t2a t2b t2c t3a t4 t5d t5u t6 t7 t8 t9 wuTim

e ra

tio A

NU

/SH

OR

E (

log)

Query

(d) Execution time ratios for hot benchmarkruns over the OO7 mediumdatabase(geometricmean= 0.580,with 0.0176coefficient of variation).

Fig. 4. Executiontime comparisonof PlatypusandSHOREfor the OO7 benchmark.Eachim-pulserepresentstheratio of averageexecutiontime for Platypusandtheaverageexecutiontimefor SHOREfor a particularoperationin the OO7 suite.Averageresultsaregeneratedfrom 3repetitions.A resultlessthan1 correspondsto PlatypusexecutingfasterthanSHORE.

Page 20: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

StoreSizeTime RatioPlatypus/SHORE

Small 0.775(0.0156)Medium 0.338(0.0176)

Table 2. Performancecomparisonof readfault timebetweenPlatypusandSHORE,showing thegeometricmeansof ratiosof readfault times for eachsystem.The resultsshown areaverageresultsfrom 3 repetitions.The valuesin brackets () depict the coefficient of variation of thecorrespondingresult.

Data Structure Scalability This experimentexploresthescalabilityof mapandlock-ing datastructuresusedin PlatypusandSHORE.We did not usetheOO7benchmark,but constructeda trivial syntheticbenchmarkthat involvedno writesanddid not readany objecttwice.Theseconstraintsallowedusto bypassthecachingandbefore-imagecreationfunctionalityof thePSI-SSMlayer, andtherebygetadirectcomparisonof thePlatypusandSSMdatastructureperformance.Thesyntheticbenchmarkoperatesovera singly-linked list of 100,000small objects.14 Beforeeachtest,a single transactionthatreadsall of theobjectsis run.This is doneto bringall objectsinto memoryandsoremoveany I/O effectsfrom thetiming of thesubsequenttransactions.After theinitialtransaction,many transactionsof varioussizes(traversingdifferentnumbersof objects)arerun.To measurethe‘per-read’cost,we subtractfrom thetime for eachtransactionthe time for anempty(zeroreads)transaction,anddivide the resultby thenumberofobjectsread.Figure5 depictsacomparisonof thedatastructurescalabilityfor PlatypusandSHORE.As thenumberof objectsreadincreases,theamountby which PlatypusoutperformSHOREalsoincreases,indicatingthe datastructuresusedin Platypusaresubstantiallymorescalablethanthoseusedin SHORE.

7 Conclusions

Thepowerof abstractionoverpersistenceis slowly beingrecognizedby thedataman-agementcommunity. Platypusis anobjectstorewhich is specificallydesignedto meetthe needsof tightly coupledpersistentprogramminglanguageclients and to deliverthetransactionalsupportexpectedby conventionaldatabaseapplications.ThedesignofPlatypusembracesthe themeof abstraction,abstractingover storetopologythroughaflexible distribution architecture,andencapsulatingthe mechanismsby which the ob-ject storeabstractionis constructedfrom an untypedbyte store.In additionto a highlevel of flexibility , Platypusis extremelyfunctional.It is multi-threadedandcansupporta high degreeof concurrency, bothintraandinterprocessandintra andinter processoron an SMParchitecture.The implementationof Platypusresultedin the developmentof asimpleextensionto theARIESrecoveryalgorithmandasimplecombinationof ex-isting datastructures,bothof which performwell andshouldhave applicationin other

14 Theobjectscontainonly apointerto thenext object.Thesmallestpossibleobjectsizewasusedto ensurethatall objectswould fit eachstore’s cacheandsoavoid thepossibilityof swappingeffectsimpactingon thecomparison.

Page 21: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

0

10

20

30

40

50

60

0 20000 40000 60000 80000 100000

Cos

t Per

Obj

ect R

ead

(mic

ro s

ec)

Number of Objects Read

ANU-StoreSHORE

Fig. 5. DatastructurescalabilitycomparisonbetweenPlatypusandSHORE.The resultsshownareaverageresultsfrom 5 repetitions.The coefficient of variation for the resultsshown aboverangefrom 0.0667to 0.112.

objectstoreimplementations.Performanceanalysisshows thatPlatypusperformsex-tremelywell, suggestingthatthecombinationof orthodoxyandinnovationin its designandimplementationhasbeenasuccessfulstrategy.

Themostobviousareafor furtherwork is to completetheconstructionof Platypus.Thecompletedstorewould includea moresophisticatedobjectmanagerthatallows agarbagecollectorto beimplemented,afully testedfailure-recoveryprocessandaGVM(allowing for a distributedstoreconfiguration).Furtherexperimentsthatmay be con-ductedincludeassessingtheperformanceof the storewhenthenumberof concurrentclient processesareincreasedin a stand-aloneconfiguration,measuringthescalabilityof thestorein a distributedconfigurationanda comparativeperformanceevaluationofusingchecksumsfor recoveryversusLSN.

Acknowledgements

Theauthorswish to acknowledgethatthiswork wascarriedoutwithin theCooperativeResearchCenterfor AdvancedComputationalSystemsestablishedundertheAustralianGovernment’sCooperativeResearchCentersProgram.

Wewishto thankAndrew Tridgell andPaulMackerrasfor many helpfulsuggestionsandcommentsonthiswork, andwould like to particularlyacknowledgetheinspirationprovidedby their package‘rsync’ for thechecksum-basedrecovery algorithmusedbyPlatypus.

Page 22: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Bibliography

[Agrawal andSrikant1994] AGRAWAL , R. AND SRIKANT, R. 1994. Fastalgorithmsfor mining associationrulesin large databases.In J. B. BOCCA , M. JARKE, AND

C. ZANIOLO Eds.,VLDB’94,Proceedingsof 20thInternationalConferenceonVeryLarge Data Bases,September12-15, 1994, Santiago de Chile, Chile (1994), pp.487–499.MorganKaufmann.

[Blackburn1998] BLACKBURN, S. M. 1998. PersistentStoreInterface:A foundationfor scalablepersistentsystemdesign. PhD thesis,AustralianNational University,Canberra,Australia.http://cs.anu.edu.au/˜Steve.Blackburn.

[BlackburnandStanton1998] BLACKBURN, S. M. AND STANTON, R. B. 1998. Thetransactionalobjectcache:A foundationfor highperformancepersistentsystemcon-struction.In R. MORRISON, M. JORDAN, AND M. ATKINSON Eds.,AdvancesinPersistentObjectSystems:Proceedingsof theEighthInternationalWorkshoponPer-sistentObjectSystems,Aug. 30–Sept.1, 1998,Tiburon, CA, U.S.A.(SanFrancisco,1998),pp.37–50.MorganKaufmann.

[Brown andMorrison1992] BROWN, A. AND MORRISON, R. 1992. A genericper-sistentobjectstore.SoftwareEngineeringJournal 7, 2, 161–168.

[Carey et al. 1994] CAREY, M. J., DEWITT, D. J., FRANKLIN, M. J., HALL , N. E.,MCAULIFFE, M. L., NAUGHTON, J. F., SCHUH, D. T., SOLOMON, M. H., TAN,C. K., TSATALOS, O. G., WHITE, S. J., AND ZWILLING, M. J. 1994. Shoringuppersistentapplications.In R. T. SNODGRASS AND M. WINSLETT Eds.,Proceed-ingsof the1994ACM SIGMODInternationalConferenceon Managementof Data,Minneapolis,Minnesota,May 24-27,1994, Volume23 of SIGMODRecord (June1994),pp.383–394.ACM Press.

[Carey et al. 1993] CAREY, M. J., DEWITT, D. J., AND NAUGHTON, J. F. 1993. The007 benchmark.In P. BUNEMAN AND S. JAJODIA Eds.,Proceedingsof the 1993ACM SIGMOD International Conferenceon Managementof Data, Washington,D.C.,May26-28,1993(1993),pp.12–21.ACM Press.

[Carey et al. 1988] CAREY, M. J., DEWITT, D. J., AND VANDENBERG, S. L. 1988.A datamodelandquerylanguagefor exodus.In H. BORAL AND P.-A . LARSON

Eds.,Proceedingsof the1988ACM SIGMODInternationalConferenceonManage-mentof Data,Chicago, Illinois, June1-3,1988(1988),pp.413–423.ACM Press.

[Deutsch1996] DEUTSCH, P. 1996. RCF 1950ZLIB compresseddataformat spec-ification version 3.3. Network Working Group Requestfor Comments:1950.http://www.faqs.org/rfcs/rfc1950.html.

[Fletcher1982] FLETCHER, J. 1982. An arithmeticchecksumfor serialtransmissions.IEEETransactionson Communications30, 1 (Jan.),247–253.

[Franklin1996] FRANKLIN, M. J. 1996. ClientDataCaching: A Foundationfor HighPerformanceObject DatabaseSystems, Volume 354 of The Kluwer InternationalSeriesin EngineeringandComputerScience. Kluwer AcademicPublishers,Boston,MA, U.S.A.Thisbookis anupdatedandextendedversionof Franklin’sPhDthesis.

Page 23: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

[Franklin1997] FRANKLIN, M. J. 1997. Concurrency controlandrecovery. In A. B.TUCKER Ed., TheComputerScienceand EngineeringHandbook, pp. 1058–1077.CRCPress.

[Furer1999] FURER, M. 1999. Randomizedsplaytrees.In Proceedingsof the tenthannualACM-SIAMSymposiumon DiscreteAlgorithms,17-19January1999,Balti-more, Maryland.(1999),pp.903–904.ACM/SIAM.

[HarderandReuter1983] HARDER, T. AND REUTER, A. 1983. Principles oftransaction-orienteddatabaserecovery. ACM ComputingSurveys15, 4 (Dec.),287–317.

[KemperandKossmann1994] KEMPER, A. AND KOSSMANN, D. 1994. Dual-bufferingstrategiesin objectbases.In J. B. BOCCA , M. JARKE, AND C. ZANIOLO

Eds.,VLDB’94, Proceedingsof 20th InternationalConferenceon Very Large DataBases,September12-15,1994,Santiago deChile, Chile (1994),pp.427–438.Mor-ganKaufmann.

[Kleimanet al. 1995] KLEIMAN, S., SMALDERS, B., AND SHAH, D. 1995. Pro-grammingwith Threads. PrenticeHall.

[Knuth 1997] KNUTH, D. E. 1997. TheArt of ComputerProgramming(seconded.),Volume3. AddisonWesley.

[Larson1980] LARSON, P.-A . 1980. Linearhashingwith partialexpansions.In SixthInternationalConferenceon Very Large Data Bases,October1-3, 1980,Montreal,Quebec,Canada,Proceedings(1980),pp.224–232.IEEEComputerSocietyPress.

[Larson1988] LARSON, P.-A . 1988. Dynamic hashtables.Communicationsof theACM 31, 4 (April), 446– 457.

[Liskov et al. 1999] L ISKOV, B., CASTRO, M., SHRIRA , L ., AND ADYA , A. 1999.Providing persistentobjectsin distributed systems.In R. GUERRAOUI Ed., EC-COP’99- Object-OrientedProgramming, 13thEuropeanConference, Lisbon,Por-tugal, June14-18,1999,Proceedings, Volume1628of Lecture Notesin ComputerScience(1999),pp.230–257.

[Litwin 1980] L ITWIN, W. 1980. Linear hashing:A new tool for file and table ad-dressing.In SixthInternationalConferenceonVeryLargeData Bases,October1-3,1980,Montreal, Quebec,Canada,Proceedings(1980),pp. 212–223.IEEE Com-puterSocietyPress.

[Mattheset al. 1996] MATTHES, F., M ULLER, R., AND SCHMIDT, J. W. 1996. To-wardsa unifiedmodelof untypedobjectstores:Experienceswith theTycoonStoreProtocol.In Advancesin DatabasesandInformationSystems(ADBIS’96),Proceed-ings of the Third InternationalWorkshopof the MoscowACM SIGMODChapter(1996).

[Mohan1990] MOHAN, C. 1990. Commit LSN: A novel andsimplemethodfor re-ducing locking and latching in transactionprocessingsystems.In D. MCLEOD,R. SACKS-DAVIS, AND H.-J. SCHEK Eds.,16thInternationalConferenceon VeryLarge Data Bases,August13-16,1990,Brisbane, Queensland,Australia, Proceed-ings(1990),pp.406–418.MorganKaufmann.

[Mohan1999] MOHAN, C. 1999. RepeatinghistorybeyondARIES.In M. P. ATKIN-SON, M. E. ORLOWSKA , P. VALDURIEZ, S. B. ZDONIK , AND M. L. BRODIE

Eds.,VLDB’99, Proceedingsof 25th InternationalConferenceon Very Large Data

Page 24: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Bases,September7-10,1999,Edinburgh, Scotland,UK (1999),pp. 1–17.MorganKaufmann.

[Mohanet al. 1992] MOHAN, C., HADERLE, D. J., L INDSAY, B. G., PIRAHESH, H.,AND SCHWARZ, P. 1992. ARIES: A transactionrecovery methodsupportingfine-granularitylockingandpartialrollbacksusingwrite-aheadlogging.TODS17, 1,94–162.

[SleatorandTarjan1985] SLEATOR, D. D. AND TARJAN, R. E. 1985. Self-adjustingbinarysearchtrees.Journalof theACM 32, 3, 652–686.

[Tayloretal. 1998] TAYLOR, R., JANA , R., AND GRIGG, M. 1998. Check-sum testing of remote synchronsationtool. Technical Report DSTO-TR-0627(March), Defence Science and Technology Organisation,Canberra,Australia.http://www.dsto.defence.gov.au.

[Tridgell 1999] TRIDGELL , A. 1999. Efficient Algorithmsfor Sorting and Synchro-nization. PhDthesis,AustralianNationalUniversity.

[Tridgell andMackerras1998] TRIDGELL , A. AND MACKERRAS, P. 1998.The rsync algorithm. Technical report, Australian National University.http://rsync.samba.org.

[Vorugantiet al. 1999] VORUGANTI , K., OZSU, M. T., AND UNRAU , R. C. 1999.An adaptivehybridserverarchitecturefor clientcachingODBMSs.In M. P. ATKIN-SON, M. E. ORLOWSKA , P. VALDURIEZ, S. B. ZDONIK , AND M. L. BRODIE Eds.,VLDB’99,Proceedingsof 25thInternationalConferenceonVeryLargeDataBases,September7-10,1999,Edinburgh,Scotland,UK (1999),pp.150–161.MorganKauf-mann.

Page 25: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

Appendix: Calculation of Page Checksums Using Differences

Wenotethefollowing definitionof an‘Adler’ signaturer � x � k� correspondingto arangeof bytesx0 � x1 ��������� xk � 1, whereM is anarbitrarymodulus(usually216 for performancereasons)[Deutsch1996]:15

r1 � x � k ����� k � 1

∑i � 0

xi � modM

r2 � x � k ����� k � 1

∑i � 0

� k � i � xi � modM

r � x � k ��� r1 � x � k � � Mr2 � x � k�

Calculating Partial Signatures The partial checksum, p � x � k � s� e� , correspondingtothecontributionto r � x � k � madeby somesub-rangeof bytesxs � xs! 1 ��������� xe (0 " s # e "k � 1) is then:

p1 � x � s� e�$��� e

∑i � s

xi � modM

p2 � x � k � s� e�$��� e

∑i � s

� k � i � xi � modM

p � x � k � s� e�$� p1 � x � s� e�%� Mp2 � x � k � s� e�

Calculating Delta Signatures ConsidertheAdler signaturer � x � k� for arangeof bytesx0 � x1 ��������� xk � 1. If thebytesin therangexs ����� xe arechangedx�s ����� x�e (0 " s " e "&� k �1� ), thecomponentsr1 � x�'� k � andr2 � x�(� k� of thenew checksumr � x�(� k� arethen:

r1 � x� � k�)�*� r1 � x � k�+� p1 � x � s� e�%� p1 � x� � s� e��� modM

�*� r1 � x � k�+�,� e

∑i � s

xi � modM �-� e

∑i � s

x�i � modM � modM

�*� r1 � x � k�%�-� e

∑i � s

� x�i � xi ��� modM � modM

r2 � x� � k�)�*� r2 � x � k�+� p2 � x � k � s� e�%� p2 � x� � k � s� e��� modM

�*� r2 � x � k�+�,� e

∑i � s

� k � i � xi � modM �.� e

∑i � s

� k � i � x�i � modM � modM

�*� r2 � x � k�+�,� e

∑i � s

� k � i �/� x�i � xi ��� modM � modM

15 Adler-32 is a 32-bit extensionof theFletcherchecksum[Fletcher1982].

Page 26: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

If we further considera delta image, x suchthat xi � x�i � xi for the rangeof bytescorrespondingto xs ����� xe, andapartialchecksump � x � k � s� e� with respectto xs ����� xe, wenotethattheabovecanbefurthersimplified:

r1 � x� � k���*� r1 � x � k�%� p1 � x � s� e��� modM

r2 � x� � k���*� r2 � x � k�%� p2 � x � k � s� e��� modM

r � x� � k���*� r � x � k�%� p � x � k � s� e��� modM

Thus the checksum(signature)r � x�0� k� for somemodified pageof size k can becalculatedfrom its previous checksumr � x � k� , and the partial checksumof the deltaimagefor the region of the pagethatwaschanged,wherethe deltaimageis a simplebyte-wisesubtractionof old from new values.

Simplifying Signature Calculation A naive implementationof the signaturecalcu-lation would seea multiply in the inner loop. The following analysisshows how the(partial) signatureimplementationcanbedonewith just a load,anaddanda subtractin theinnerloop.

p2 � x � k � s� e�$�1� e

∑i � s

� k � i � xi � modM

�1� e� s

∑i � 0

� k � i � s� xs! i � modM

�1��� k � s� e� s

∑i � 0

xs! i � e� s

∑i � 0

ixs! i � modM

�1��� k � s� e� s

∑i � 0

xs! i � e� s

∑i � 0

i � 1

∑j � 0

xs! i � modM

Fromtheabove,we notethefollowing recurrencerelations:

fp � x � e� i ���23 4 fp � x � e� i � 1�%� xe� i if i 5 0

xe if i � 0undefined otherwise

f � x � e� i �6�23 4 f � x � e� i � 1�%� fp � x � e� i � 1� if i 5 0

0 if i � 0undefined otherwise

We thereforenotethefollowing expressionsfor p1, p2 andp:

p2 � x � k � s� e�6�*��� k � s� fp � x � e� e � s�+� f � x � e� e � s��� modM

p1 � x � s� e�6� fp � x � e� e � s� modM

p � x � k � s� e�6� fp � x � e� e � s� modM � M ����� k � s� fp � x � e� e � s�7� f � x � e� e � s��� modM �

Page 27: Platypus: Design and Implementation of a Flexible High Performance Object Store.users.cecs.anu.edu.au/~steveb/pubs/papers/platypus-pos9.pdf · 2020-02-09 · Platypus: Design and

which leadsusto thefollowing pseudo-code:

int signature(char *x, /* start of block */int L, /* block length */int s, /* start index of signature region */int e /* end index of signature region */)

{for (int i = e, p_1 = 0, p_2 = 0; i > s; i++) {

p_1 += x[i];p_2 -= p_1;

}p_1 += x[s];p_2 += (L-s) * p_1;p_1 &= (1<<16)-1; /* mod 2ˆ16 */p_2 &= (1<<16)-1; /* mod 2ˆ16 */return p_1 | (p_2 << 16);

}

Fig. 6. Pseudocodefor partialAdler-32signaturecalculation.