Distributed Systems - Synergy Labs · 2019. 10. 6. · • 2PC proposed in 1979 (Gray) • In 1981,...
Transcript of Distributed Systems - Synergy Labs · 2019. 10. 6. · • 2PC proposed in 1979 (Gray) • In 1981,...
-
DistributedSystems
15-440/15-640–Fall2019
12–DistributedReplication
1
-
FaultToleranceTechniquesSoFar?• Redundancy:information/time/physicalredundancy• E.g.,usedinairplanes
• Recovery:checkpointingandlogging(ARIES)• E.g.,usedincommercialdatabases
• Previous(concurrency)protocolsrelyonrecoverytechniques• E.g.,TwoPhaseCommitisnotfaulttolerantbyitself
•Whynotalwaysusethesetechniques?! Longwaitincaseoffailure
2
-
OurGoalToday:StayUpDuringFailures
• Provideaservice• Replicatethemachinesthatserveclients• Survivethefailureofuptofreplicas• Provideidenticalservicetoanon-replicatedversion• (exceptmorereliable,andperhapsdifferentperformance)
3
-
OutlineforToday
Consistencywhencontentisreplicated
Primary-backupreplicationmodel
Consensusreplicationmodel
4
-
SimpleExamplesofReplication• Replicatedwebsites
• e.g.,Yahoo!orAmazon:• DNS-basedloadbalancing(DNSreturnsmultipleIPaddressesforeachname)
• HardwareloadbalancersputmultiplemachinesbehindeachIPaddress
• Whenisreplicationeasy?Whenhard?• Workloadassumptions
5
-
Read-onlycontent
• Easytoreplicate-justmakemultiplecopiesofit.• Performanceboost:Gettousemultipleserverstohandletheload;
• Perfboost2:Locality.We’llseethislaterwhenwediscussCDNs,canoftendirectclienttoareplicanearit
• Availabilityboost:Canfail-over(doneatbothDNSlevel--slower,becauseclientscacheDNSanswers--andatfront-endhardwarelevel)
6
-
ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite
• SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence
7
-
SequentialConsistency(1)
• Behavioroftwoprocessesoperatingonthesamedataitem.Thehorizontalaxisistime.
• P1:Writes“W”valueatovariable“x”• P2:Reads`NIL’from“x”firstandthen`a’
Adaptedfrom:Tanenbaum&VanSteen,DistributedSystems:PrinciplesandParadigms,2e,(c)2007Prentice-Hall,Inc.Allrightsreserved.0-13-239227-5 8
-
SequentialConsistency(2)
(b)Adatastorethatisnotsequentiallyconsistent.
(a)Asequentiallyconsistentdatastore.
9
-
ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite
• SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence
• CausalConsistency• Allnodesseepotentiallycausallyrelatedwritesinsameorder• Butconcurrentwritesmaybeseenindifferentorderondifferentmachines
10
-
CausalConsistency(1)
Thissequenceisallowedwithacausally-consistentstore,butnotwithasequentiallyconsistentstore.
11
-
CausalConsistency(2)
Aviolationofacausally-consistentstore.
(W(x)acausallyrelatedtoR(x)a,W(x)b.)
12
-
ButRead-writeData...• Requireswritereplication,andsomedegreeofconsistency• StrictConsistency• Readalwaysreturnsvaluefromlatestwrite
• SequentialConsistency• Allnodesseeoperationsinsomesequentialorder• Operationsofeachprocessappearin-orderinthissequence
• CausalConsistency• Allnodesseecausallyrelatedwritesinsameorder• Butconcurrentwritesmaybeseenindifferentorderondifferentmachines
• EventualConsistency• Allnodeswilllearneventuallyaboutallwrites,intheabsenceofupdates
13
-
ExampleofConsistencyGuarantees• Inpracticeweoftenhaveachoice
• GoogleMail• Sendingmailisreplicatedto~2physicallyseparateddatacenters(usershateitwhentheythinktheysentmailanditgotlost);mailwillpausewhiledoingthisreplication.• Q:Howlongwouldthistakewith2-phasecommit?inthewidearea?
• Markingmailreadisonlyreplicatedinthebackground-youcanmarkitread,thereplicationcanfail,andyou’llhavenoclue(re-readingareademailonceinawhileisnobigdeal)
• Weakerconsistencyischeaperifyoucangetawaywithit.14
-
ReplicationStrategiesWhattoreplicate:StateversusOperations• Propagateonlyanotificationofanupdate
• Sortofan“invalidation”protocol
• Transferdatafromonecopytoanother• Read-to-Writeratiohigh,canpropagatelogs(savebandwidth)
• Propagatetheupdateoperationtoothercopies• Don’ttransferdatamodifications,onlyoperations–“Activereplication”
Whentoreplicate:PushvsPull• PullBased
• Replicas/Clientspollforupdates(caches)• PushBased
• Serverpushesupdates(stateful) 15
-
OutlineforToday
Consistencywhencontentisreplicated
Primary-backupreplicationmodel
Consensusreplicationmodel
16
-
AssumptionsToday• Groupmembershipmanager• Allowreplicanodestojoin/leave
• Fail-stop(notByzantine)failuremodel• Serversmightcrash,mightcomeupagain
• Delayed/lostmessages
• Failuredetector• E.g.,process-pairmonitoring,etc.
17
-
Primary-Backup:RemoteWriteProtocol•Writesalwaysgotoprimary,readfromanybackup
• Implementation• Streamthelog
• Commoninpractice• Simple
• Areupdatesblocking?18
-
Local-WriteP-BProtocol
19
Primarymigratestotheprocesswantingtoprocessupdate Forperformance,usenon-blockingop.Whatdoesthisschemeremindyouof?
-
Primary-BackupProperties• Thislookscool.Howmanyfailurescanwedealwith?Whataresomeproblems?• Whatdowedoifareplicahasfailed?• Wewait...howlong?Untilit’smarkeddead.
• Advantage:WithNservers,cantoleratelossofN-1copies• Notagreatsolutionifyouwantverytightresponsetimeevenwhensomethinghasfailed:Mustwaitforfailuredetector
• Note:Ifyoudon’tcareaboutstrongconsistency(e.g.,the“mailread”flag),youcanreplytoclientbeforereachingagreementwithbackups(sometimescalled“asynchronousreplication”).
20
-
OutlineforToday
Consistencywhencontentisreplicated
Primary-backupreplicationmodel
Consensusreplicationmodel
21
-
QuorumBasedConsensus• Designedtohavefastresponsetimeevenunderfailures• Operateaslongasmajorityofmachinesisstillalive
• Nomaster,perse• Tohandleffailures,musthave2f+1replicas• Also,forreplicated-write=>writetoallreplica’snotjustone
• UsuallyboilsdowntoPaxos[Lamport]
22
-
Decomposetheproblem:
• BasicPaxos(“singledecree”):• Oneormoreserversproposevalues• Systemmustagreeonasinglevalueaschosen• Onlyonevalueiseverchosen
•Multi-Paxos:• CombineseveralinstancesofBasicPaxostoagreeonaseriesofvaluesformingthelog
ThePaxosApproach
SomeSlidesAdaptedfrom:JohnOusterhout&DiegoOngaro,StanfordUniversity.ImplementingReplicatedLogswithPaxos.2013. 23
-
RequirementsforBasicPaxos• Correctness(safety):• Onlyasinglevaluemaybechosen• Amachineneverlearnsthatavaluehasbeenchosenunlessitreallyhasbeen• TheagreedvalueXhasbeenproposedbysomenode
• Liveness(termination):• Someproposedvalueiseventuallychosen• Ifavalueischosen,serverseventuallylearnaboutit
• Fault-tolerance:• IflessthanN/2nodesfail,therestshouldreachagreementeventuallyw.h.p• Livenessisnotguaranteed
24
-
[FLP’85]ImpossibilityResult• SynchronousDS:boundedamountoftimenodecantaketoprocessandrespondtoarequest
• AsynchronousDS:timeoutisnotperfect
Fischer-Lynch-PatersonResult
Itisimpossibleforasetofprocessorsinanasynchronoussystemtoagreeonabinaryvalue,evenifonlyasingleprocessorissubjecttoanunannouncedfailure.
25
-
PaxosComponents• Proposers:• Active:putforthparticularvaluestobechosen• Handleclientrequests
• Acceptors:• Passive:respondtomessagesfromproposers• Responsesrepresentvotesthatformconsensus• Storechosenvalue,stateofthedecisionprocess
• Forthispresentation:• EachPaxosservercontainsbothcomponents• Ignorethirdrole,akaLearner
• “Round”:(proposal,messages/voting,decision)• Wemayneedseveralrounds
ProposerAcceptor
ProposerAcceptor
ProposerAcceptor
26
-
Strawman:BasicTwo-Phase• Coordinatortellsreplicas:“ValueV”• ReplicasACK• Coordinatorbroadcasts“Commit!”
• Thisisn’tenough•Whatifthere’smorethan1coordinatoratthesametime?•Whatifnewcoordinatorchoosesadifferentvalue?
• Whatifsomeofthenodesorthecoordinatorfailsduringthecommunication?•Whatifthereisanetworkpartition?
27
-
Let’sDiscussSomeProblems&Solutions• Problem:can’ttrustasinglenode• Solution:everyonecanpotentiallypropose
• Problem:severalconcurrentproposers• Solution:Quorum(requiremajorityofacceptors)
• Problem:splitvotes,noproposerreachesmajority• Solution:acceptorsneedtoallowupdatingoftheirvalue
• Problem:conflictingchoices(duetoupdating)• Solutiona):prioritizeproposalwithhighestuniquetimestamp(Lamportclocks)• Solutionb):oncemajorityhasagreedonvalue,futureproposalsforcedtopropose/choosesamevalue
28
-
• Phase1:Preparemessage• Findoutaboutanychosenvalues• Blockolderproposalsthathavenotyetcompleted
• Phase2:Acceptmessage• Askacceptorstoacceptaspecificvalue
• (Phase3):Proposerdecides• Ifmajorityagain:chosenvalue,commit.• Ifnomajority:delayandrestartPaxos
SingleDecreePaxos:InformalDescription
Proposers Acceptors
Prepare Check,
Return
Accept
Waitformajority
CheckAgain,
ReturnWaitformajority
Decision
29
-
SingleDecreePaxos:ProtocolAcceptors
3)RespondtoPrepare(n):• Ifn>minProposalthenminProposal=nPrepare-OK(acceptedProposal,acceptedValue)elsePrepare-REJECT()
6)RespondtoAccept(n,value):• Ifn≥minProposal acceptedProposal=minProposal=n acceptedValue=value
Accept-OK()elseAccept-REJECT()
AcceptorsmustrecordminProposal,acceptedProposal,andacceptedValueonstablestorage(disk)
Proposers1)Choosenewproposalnumbern,valuev2)BroadcastPrepare(n)toallservers
4)Whenresponsesreceivedfrommajority:• IfanyacceptedValuesreturnedv=acceptedValueofhighestacceptedProposal
5)BroadcastAccept(n,value)toallservers
6)WhenAccept-OKfrommajorityValueischosen(Commit)ElseRestart:goto1,withlargernumbern
30
-
Paxos Examples
31
a) SuccessfulRoundwithaSingleProposer
b)DuelingProposers
-
SomeRemarks• Onlyproposerknowschosenvalue(majorityaccepted)• Onlyasinglevalueischosen! MultiPaxos
• Noguaranteethatproposer’soriginalvaluevischosenbyitself
• NumbernisbasicallyaLamportclock! alwaysuniquen• Keyinvariant:• Ifaproposalwithvalue`v'ischosen,allhigherproposalsmusthavevalue`v’
• Duelingproposer• Resolvedusingnumberninprepare
• Therearechallengingcornercases32
-
SingleDecreePaxos:ProtocolAcceptors
3)RespondtoPrepare(n):• Ifn>minProposalthenminProposal=nPrepare-OK(acceptedProposal,acceptedValue)elsePrepare-REJECT()
6)RespondtoAccept(n,value):• Ifn≥minProposal acceptedProposal=minProposal=n acceptedValue=value
Accept-OK()elseAccept-REJECT()
AcceptorsmustrecordminProposal,acceptedProposal,andacceptedValueonstablestorage(disk)
Proposers1)Choosenewproposalnumbern,valuev2)BroadcastPrepare(n)toallservers
4)Whenresponsesreceivedfrommajority:• IfanyacceptedValuesreturnedv=acceptedValueofhighestacceptedProposal
5)BroadcastAccept(n,value)toallservers
6)WhenAccept-OKfrommajorityValueischosen(Commit)ElseRestart(goto1,withlargernumbern)
33
-
Paxosiswidespread!• Industryandacademia•Google:Chubby(distributedlockservice)• Yahoo:Zookeeper(distributedlockservice)•MSR:Frangipani(distributedlockservice)•OpenSourceimplementations▪Libpaxos(paxosbasedatomicbroadcast)▪Zookeeperisopensource,integratedw/Hadoop
PaxosslidesadaptedfromJinyangLi,NYU; 34
-
PaxosHistoryIttook25yearstocomeupwithsafeprotocol
• 2PCproposedin1979(Gray)• In1981,Stonebrakerproposedabasic,unsafe3PC• 1988,BrianOkiandBarbaraLiskovcreatedViewstampedReplication, whichhasthecoreprotocol.
• In1998,Lamportrediscovereditandexplainedtheprotocolformally,namingitPaxos
• 2001”Paxosmadesimple”.• In~2007RAFTappears,presentingtheViewstampedReplicationapproachtoPaxos asacleanlyisolatedprotocol.
35
-
MoreRemarks
• Paxosispainfultogetright,particularlythecornercases.Startfromagoodimplementationifyoucan.SeeYahoo’s“Zookeeper”asastartingpoint.
• Therearelotsofoptimizationstomakethecommon/noorfewfailurescasesgofaster;ifyoufindyourselfimplementing,researchthese.
• Paxosisexpensive.Usually,usedforcritical,smallerbitsofdataandtocoordinatecheaperreplicationtechniquessuchasprimary-backupforbigbulkdata.
36
-
BeyondPAXOS
•Manyfollowupsandvariants• RAFTconsensusalgorithm• https://raft.github.io/
• Greatvisualizationofhowitworks• http://thesecretlivesofdata.com/raft/
https://raft.github.io/http://thesecretlivesofdata.com/raft/
-
Summary• Primary-backup• Writeshandledbyprimary,streamlogtobackup(s)• Replicasare“passive”,followprimary• Good:Simpleprotocol.Nmachines,canhandleN-1failures• Bad:Slowresponsetimesincaseoffailures.
• Quorumconsensus• Designedtohavefastresponsetimeevenunderfailures• Replicasare“active”-participateinprotocol;thereisnomaster,perse.• Good:Clientsdon’tevenseethefailures• Bad:Morecomplex(cornercases).Tohandleffailures,musthave2f+1replicas. 38