Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... ·...

30
Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev , Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah, School of CompuEng

Transcript of Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... ·...

Page 1: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

TransparentCheckpointofClosedDistributedSystemsin

EmulabAntonBurtsev,PrashanthRadhakrishnan,

MikeHibler,andJayLepreau

UniversityofUtah,SchoolofCompuEng

Page 2: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Emulab

•  PublictestbedfornetworkexperimentaEon

2

•  Complexnetworkingexperimentswithinminutes

Page 3: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Emulab—preciseresearchtool

•  Realism:–  Realdedicatedhardware•  Machinesandnetworks

–  RealoperaEngsystems–  FreedomtoconfigureanycomponentofthesoNwarestack

– Meaningfulreal‐worldresults•  Control:

–  Closedsystem•  Controlledexternaldependenciesandsideeffects

–  Controlinterface–  Repeatable,directedexperimentaEon

3

Page 4: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Goal:morecontroloverexecuEon

•  Statefulswap‐out–  Demandforphysicalresourcesexceedscapacity–  PreempEveexperimentscheduling•  Long‐running•  Large‐scaleexperiments

–  Nolossofexperimentstate

•  Time‐travel–  Replayexperiments•  DeterminisEcallyornon‐determinisEcally

–  Debuggingandanalysisaid

4

Page 5: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Challenge

•  BothcontrolsshouldpreservefidelityofexperimentaEon

•  Bothrelyontransparencyofdistributedcheckpoint

5

Page 6: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Transparentcheckpoint•  TradiEonally,semanEctransparency:

–  CheckpointedexecuEonisoneofthepossiblecorrectexecuEons

•  Whatifwewanttopreserveperformancecorrectness?–  CheckpointedexecuEonisoneofthecorrectexecuEonsclosesttoanon‐checkpointedrun

•  Preservemeasurableparametersofthesystem–  CPUallocaEon–  ElapsedEme–  Diskthroughput–  Networkdelayandbandwidth

6

Page 7: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

TradiEonalview

•  Localcase–  Transparency=smallestpossibledownEme–  Severalmilliseconds[Remus]–  Backgroundwork–  Harmsrealism

•  Distributedcase–  Lamportcheckpoint•  Providesconsistency

–  Packetdelays,Emeouts,trafficbursts,replaybufferoverflows

7

Page 8: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Maininsight

•  Concealcheckpointfromthesystemundertest–  ButsEllstayontherealhardwareasmuchaspossible

•  “Instantly”freezethesystem–  TimeandexecuEon–  Ensureatomicityofcheckpoint•  Singlenon‐divisibleacEon

•  ConcealcheckpointbyEmevirtualizaEon

8

Page 9: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

ContribuEons

•  Transparencyofdistributedcheckpoint•  Localatomicity

–  Temporalfirewall

•  ExecuEoncontrolmechanismsforEmulab–  Statefulswap‐out–  Time‐travel

•  Branchingstorage

9

Page 10: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

ChallengesandimplementaEon

10

Page 11: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

CheckpointessenEals•  StateencapsulaEon

–  SuspendexecuEon–  Saverunningstateofthe

system•  VirtualizaEonlayer

11

Page 12: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

CheckpointessenEals•  StateencapsulaEon

–  SuspendexecuEon–  Saverunningstateofthe

system•  VirtualizaEonlayer

–  Suspendsthesystem–  Savesitsstate–  Savesin‐flightstate–  Disconnects/reconnectsto

thehardware

12

Page 13: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Firstchallenge:atomicity•  PermanentencapsulaEonis

harmful–  Tooslow–  Somestateisshared

•  Encapsulateduponcheckpoint

•  ExternallytoVM–  FullmemoryvirtualizaEon–  NeedsdeclaraEvedescripEon

ofsharedstate

•  InternallytoVM–  Breaksatomicity

13

?

Page 14: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Atomicityinthelocalcase

•  Temporalfirewall–  SelecEvelysuspendsexecuEonandEme

–  Providesatomicityinsidethefirewall

•  ExecuEoncontrolintheLinuxkernel–  Kernelthreads–  Interrupts,excepEons,IRQs

•  Concealscheckpoint–  TimevirtualizaEon

14

Page 15: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Secondchallenge:synchronizaEon

???$%#!Timeout

•  Lamportcheckpoint–  NosynchronizaEon–  SystemisparEally

suspended

•  Preservesconsistency–  Logsin‐flightpackets

•  Onceloggedit’simpossibletoremove

•  Unsuspendednodes–  Time‐outs

15

Page 16: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Synchronizedcheckpoint

•  Synchronizeclocksacrossthesystem

•  Schedulecheckpoint

•  Checkpointallnodesatonce

•  Almostnoin‐flightpackets

16

Page 17: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Bandwidth‐delayproduct•  Largenumberofin‐

flightpackets

•  Slowlinksdominatethelog

•  FasterlinkswaitfortheenErelogtocomplete

•  Per‐pathreplay?–  UnavailableatLayer2–  Accuratereplayengineoneverynode

17

Page 18: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Checkpointthenetworkcore•  LeverageEmulabdelay

nodes–  Emulablinksareno‐delay–  LinkemulaEondoneby

delaynodes

•  Avoidreplayofin‐flightpackets

•  Captureallin‐flightpacketsincore–  Checkpointdelaynodes

18

Page 19: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Efficientbranchingstorage

•  TobepracEcalstatefulswap‐outhastobefast

•  Mostlyread‐onlyFS–  Sharedacrossnodesandexperiments

•  Deltasaccumulateacrossswap‐outs

•  BasedonLVM– ManyopEmizaEons

19

Page 20: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

EvaluaEon

Page 21: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

EvaluaEonplan

•  Transparencyofthecheckpoint•  Measurablemetrics

–  TimevirtualizaEon–  CPUallocaEon–  Networkparameters

21

Page 22: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

TimevirtualizaEon

22

Checkpointevery5sec(24checkpoints)

Timeraccuracyis28μsec

Checkpointadds±80μsecerror

do{usleep(10ms)germeofday()}while()

sleep+overhead=20ms

Page 23: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

CPUallocaEon

23

Checkpointevery5sec(29checkpoints)

do{stress_cpu()germeofday()}while()

stress+overhead=236.6ms

Normallywithin9msofaverage

Checkpointadds27mserror

ls/root–7msoverheadxmlist–130ms

Page 24: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Networktransparency:iperf

24

NoTCPwindowchangeNopacketdrops

ThroughputdropisduetobackgroundacEvity

Averageinter‐packetEme:18μsecCheckpointadds:330‐‐5801μsec

Checkpointevery5sec(4checkpoints)

‐1Gbps,0delaynetwork,‐ iperfbetweentwoVMs‐ tcpdumpinsideoneofVMs‐ averagingover0.5ms

Page 25: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Networktransparency:BitTorrent

25

100Mbps,lowdelay1BTserver+3clients3GBfile

Checkpointpreservesaveragethroughput

Checkpointevery5sec(20checkpoints)

Page 26: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Conclusions

•  Transparentdistributedcheckpoint–  Preciseresearchtool–  Fidelityofdistributedsystemanalysis

•  Temporalfirewall–  GeneralmechanismtochangepercepEonofEmeforthesystem

–  Concealvariousexternalevents

•  FutureworkisEme‐travel

26

Page 27: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Thankyou

[email protected]

Page 28: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Backup

28

Page 29: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Branchingstorage

29

•  Copy‐on‐writeasaredolog•  Linearaddressing•  FreeblockeliminaEon•  ReadbeforewriteeliminaEon

Page 30: Transparent Checkpoints of Closed Distributed Systems in Emulababurtsev/doc/Eurosys09... · 2016-08-16 · • Transparent distributed checkpoint – Precise research tool – Fidelity

Branchingstorage

30