Consensus Services for Loosely- coupled distributed...
Transcript of Consensus Services for Loosely- coupled distributed...
ConsensusServicesforLoosely-coupleddistributedSystems
ZookeeperandChubby
Mo<va<on
• Distributedsystemsneedaservicethatprovides– Synchroniza<on(whoistheleader?)– Membership(whoareac<venodesinmyservice?)– Configura<onmetadata(whatareenvironmentinfo?)
• Distributedsystemsneedaservicethatis– reliability– availability– easy-to-understandseman<cs– performance,throughput,latencyonlysecondary
BeforeChubbyCameAbout…
• Lotsofdistributedsystemswithclientsinthe10,000s
• Howtodoprimaryelec<on?– Adhoc(noharmfromduplicatedwork)– Operatorinterven<on(correctnessessen<al)
• Unprincipled• Disorganized• Costly• Lowavailability
Whatisthispaperabout?
“BuildingChubbywasanengineeringeffort…itwasnotresearch.Weclaimnonewalgorithmsortechniques.Thepurposeofthispaperistodescribewhatwedidandwhy,ratherthantoadvocateit.”
• Designbasedonwell-knownideas– distributedconsensus,caching,no<fica<ons,file-systeminterface
LifeBeforeCubby
• Distributedsystemsdevelopers..– ImplementRaR(wellactuallyPaxos)• Applica<onmustbewriUenasastatemachine• Poten<alperformanceproblems
– Quorumon5iseasieroverquorumof10Knodes
– Sharedcri<calregions(Exclusivelocks)• Hardtocode/understand
– Peoplethinktheycan…buttheycan’t!
ChubbyDesign
DesignDecisions:Mo<va<ngLocks?
• Lockservicevs.consensus(RaR/Paxos)library
• Advantages:– Noneedtorewritecode
• Maintainprogramstructure,communica<onpaUerns• Cansupportno<fica<onmechanism
– Smaller#ofnodes(servers)neededtomakeprogress
• Advisoryinsteadofmandatorylocks(why?):– HoldingalockcalledFneitherisnecessarytoaccessthefileF,norpreventsotherclientsfromdoingso
DesignDecisions:LockTypes• Coarsevs.fine-grainedlocks– Fine-grained:grablockbeforeeveryevent– Coarse-grained:grablockforlargegroupofevents
• Advantagesofcoarse-grainedlocks– Lessloadonlockserver– Lessdelaywhenlockserverfails– Lesslockserversandavailabilityrequired
• Advantagesoffine-grainedlocks– Morelockserverload– Ifneeded,couldbeimplementedonclientside
SystemStructure
• Chubbycell:asmallnumberofreplicas(e.g.,5)
• Masterisselectedusingaconsensusprotocol(e.g.,RaR)
SystemStructure
• Clients– Sendreads/writesonlytothemaster
– Communicateswithmasterviaachubbylibrary
• Everyreplicaserver– IslistedinDNS– Directclientstomaster– Maintaincopiesofasimpledatabase
ReadandWrites
• Write– Masterpropagateswritetoreplica
– RepliesaRerthewritereachesamajority(e.g.,quorum)
• Read– Masterrepliesdirectly,asithasmostuptodatestate
– Readsmusts<llgotothemaster
ChubbyAPIandLocks
SimpleUNIX-likeFileSystemInterface
• Barebonefile&directorystructure
• /ls/foo/wombat/pouch
Lock service; common to all names
Cell name
Name within cell
SimpleUNIX-likeFileSystemInterface
• Barebonefile&directorystructure
• /ls/foo/wombat/pouch
• Doesnotsupport,maintain,orreveal– Movingfiles– Path-dependentpermissionseman<cs– Directorymodified<mes,fileslast-access<mes
Nodes• Node:afileordirectory– Anynodecanactasanadvisoryreader/writerlock
• Anodemaybeeitherpermanentorephemeral– Ephemeralusedastemporaryfiles,e.g.,indicateaclientisalive
• Metadata– ThreenamesofACLs(R/W/changeACLname)• Authen<ca<onbuildintoROC
– 64-bitfilecontentchecksum
Locks• Any node can act as lock (shared or exclusive)
• Advisory (vs. mandatory) – Protect resources at remote services – No value in extra guards by mandatory locks
• Write permission needed to acquire – Prevents unprivileged reader blocking progress
LocksandSequences• Poten<allockproblemsindistributedsystems
– AholdsalockL,issuesrequestW,thenfails– BacquiresL(becauseAfails),performsac<ons– Warrives(out-of-order)aRerB’sac<ons
• Solu<on1:backwardcompa<ble– Lockserverwillpreventotherclientsfromgehngthelockifalockbecome
inaccessibleortheholderhasfailed– Lock-delayperiodcanbespecifiedbyclients
LocksandSequences• Poten<allockproblemsindistributedsystems
– AholdsalockL,issuesrequestW,thenfails– BacquiresL(becauseAfails),performsac<ons– Warrives(out-of-order)aRerB’sac<ons
• Solu<on2:sequencer– AlockholdercanobtainasequencerfromChubby– ItaUachesthesequencertoanyrequeststhatitsendstootherservers
– Theotherserverscanverifythesequencerinforma<on
Design:Events• Clientsubscribeswhencrea<nghandle
• Deliveredasyncviaup-callfromclientlibrary
• Eventtypes– Filecontentsmodified– Childnodeadded/removed/modified– Chubbymasterfailedover– Handle/lockhavebecomeinvalid– Lockacquired/conflic<nglockrequest(rarelyused)
Design:API
• Open() (only call using named node) – how handle will be used (access checks here) – events to subscribe to – lock-delay – whether new file/dir should be created
• Close() vs. Poison()
• Other ops: – GetContentsAndStat(), SetContents(), Delete(), Acquire(), TryAcquire(), Release(),
GetSequencer(), SetSequencer(), CheckSequencer()
Chubby:ProvidingPerformance
Whyscalingisimportant?• Clientsconnecttoasingleinstanceofmasterinacell
– Muchmoreclientprocessesthatnumberofmachines– Note:MasterandclienthavesameCPU/Memory
• Exis<ngmechanisms:– Par$$on:MoreChubbycells(consistenthashing)– Increaselease$me:from12sto60storeduceKeepAlivemessages(dominantrequestsinexperiments)
– Clientcaches:reducesreads/keepalivenotwrites– Addnewtypeofservers:AddProxyservers
Caching• Client caches file data, node meta-data
– Write-through held in memory
• masterkeepslistofwhatclientsmayhavecached
• Strictconsistency:easytounderstand– Leasebased– Masterwillinvalidatecachedcopiesuponawriterequest– donotwanttoalterpreexis<ngcomm.protocols
• Handles and locks cached as well – Event informs client of conflicting lock request
CachingandInvalida<on– writesblock,mastersendsinvalida<ons– clientsflushchangeddata,ack.withKeepAlive– datauncachableun<linvalida<onacked
• allowsreadstohappenwithoutdelay
NewMechanisms
• Proxies:– HandleKeepAliveandreadrequests,passwriterequeststothemaster
– Reducetrafficbutreduceavailability
• Par<<oning:par<<onnamespace– Amasterhandlesnodeswithhash(name)modN==id
– Limitedcross-par<<onmessages
Scaling:Proxies• Proxiespassrequestsfromclientstocell
– Alayerofmiddlemanagementbetweencellandclients
• CanhandleKeepAlivesandreadsNOTWRITES– Notwrites,buttheyare<<1%ofworkload
• KeepAlivetrafficbyfarmostdominant
• Disadvantages:– addi<onalRPCforwrites/first<mereads– increasedunavailabilityprobability– fail-overstrategynotideal(willcomebacktothis)
Scaling:Par<<oning• Namespacepar<<onedbetweenservers– Npar<<ons,eachwithmasterandreplicas
• NodeD/CstoredonP(D/C)=hash(D)modN– meta-dataforDmaybeondifferentpar<<on
• LiUlecross-par<<oncomm.desirable– permissionchecks– directorydele<on– cachinghelpsmi<gatethis
Chubby:ProvidingAvailability
SessionsandKeep-Alives• Aclientsendskeep-aliverequeststoamaster
• Amasterrespondsbyakeep-aliveresponse
• ImmediatelyaRergehngthekeep-aliveresponse,theclientsendsanotherrequestforextension
• Themasterwillblockkeep-alivesun<lclosetheexpira<onofasession
• Extensionisdefaultto12s
Whenthingsfail…
• Failure==missingkeepalivemsgs
• Server– Deleteephemeralfilesw/oopenhandlesaReraninterval
– Deleteassociatedcache
• Client:discardinmemorystate– Sessions,handles,locks,cacheddata…
LocksandAsynchFailures• Asynchfailure:delayed,orderedorlostmessages
• Twosolu<ons– Sequence#s:getLockandaSequence-ID(seq-#)
• SubmitrequestswithLock-Seq-ID• LeadercheckersLock-seq-IDtomakesurerequestiscurrent• Lock-seq-IDisincrementedwhenanewlockisacquired
– LockDelay:delaygivingoutnewlocksaRerthelockholderdies• Outstandingrequestsshouldarrivewithinthat<meperiod
Chubby:Prac<cal,warstories
UseandObserva<ons• Manyfilesfornaming
• Config,ACL,meta-datacommon
• 10clientsuseeachcachedfile,onavg.
• Fewlocksheld,nosharedlocks
• KeepAlivesdominateRPCtraffic
Use:Outages• Sampleofcells– 61outagesoverfewweeks(700cell-days)– duetonetworkconges<on,maintenance,overload,errorsinsoRware,hardware,operators
• 52outagesunder30s – applica<onsnotsignificantlyaffected
• Fewdozencell-yearsofopera<on– dataloston6occasions(bugs&operatorerror)
Today• Google’sChubby– Mo<va<on– Designchoices– Scaling/Performance– Availability
• Yahoo’sZooKeeper(NowApache’sZookeeper)– DifferenceswithChubby
• Summary
Zookeeper=Chubbywithoutlocks
• Filesystem-basedAPI– Similartypes:Ephemeral,persistent,sequen<al– DifferentAPIcalls
• Performance– Cachingandwatches(ZK’sinvalida<on)• Similarlyoneshot!
• Availability– Keepalive,leaderelec<on
FileTypesChubby ZooKeeper
Filetypes Ephemeral Yes Yes
Permanent Yes Yes
Sequen<al NO Yes
FileAPI Hierarchical Yes Yes
Predefinedpath Yes NO/
services
users
apps
locks
servers
YaView
s-1
morestupidity
stupidname
EphemeralscreatedbySessionX
Sequenceappendedoncreate
• Ephemeral:theznodewillbedeletedwhenthesessionthatcreatedit<mesoutoritisexplicitlydeleted
• Permanent:explicitlydeletedbyclient
• Sequence:thepathnamewillhaveamonotonicallyincreasingcounterrela<vetotheparentappended
FilesystemAPI
/ls/foo/wombat/pouch
Lock service; common to all names
Cell name
Name within cell
/
services
users
apps
locks
servers
YaView
read-1
morestupidity
stupidname
Read/WriteInterac<ons• Writes
– Allgothroughleader– Requiresquorum
• Reads– ZK:gotoanynode
• Higherperformance– Chubby:gotoleader
• Performancelimita<on– Both:Clientscacheandserverinvalidatescache
• Chubby:invalida<onisnon-op<onal• ZK:mustexplicitlyregisterforinvalida<onrequests