1 Fault Tolerant Systems Checkpointing and Rollback Recovery Protocols.
Cost-Effective Architectural Support for Rollback Recovery ...
Transcript of Cost-Effective Architectural Support for Rollback Recovery ...
ReVive:Cost-Effective Architectural Support for
Rollback Recoveryyin Shared-Memory Multiprocessors
Milos Prvulovic, Zheng Zhang*, Josep Torrellas
University of Illinois at Urbana-Champaign*Hewlett-Packard Laboratoriesw c d b s
Motivation
Availability & Reliability increasingly important
Frequency ↑, Feature Size ↓ ⇒ Errors ↑
Complexity ↑ Verification Cost ↑ ⇒ Errors ↑Complexity ↑, Verification Cost ↑ ⇒ Errors ↑
Multiprocessors ⇒ Errors ↑
Global software-only recovery too slow
Can hardware help?Can hardware help?
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 2
Motivation
Cost vs. Performance vs. Availability
Low Cost– Simple changes to a few key componentsSimple changes to a few key components
Low Performance Overhead– Handle frequent operations in hardware
High Availability– Fast recovery from a wide class of errors
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 3
Contribution: New Scheme
Low Cost– HW changes only to directory controllers
– Memory overhead only 12.5% (with 7+1 parity)
Low Performance Overhead– Only 6% performance overhead on average– Only 6% performance overhead on average
High Availability– Recovery from: system-wide transients, loss of one node
– Availability better than 99.999% (assuming 1 error/ day)
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 4
Overview of ReVive
Entire main memory protected by distributed parity– Like RAID-5, but in memory
Periodically establish a checkpointod c y s b s c c po– Main memory is the checkpoint state
Writ b k dirt d t fr h pr r t t– Write-back dirty data from caches, save processor context
Save overwritten data to enable restoring checkpoint– When program execution modifies memory for 1st time
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 5
Distributed N+1 ParityNode 0 Node 1 Node N
Parity Data DataParity Group
. . .Distributed tominimize contention
Allocation Granularity: page
dUpdate Granularity: cache line
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 6
Distributed Parity Update in HW
WB
ParPar Ack
Di
WB Line X
DiDi OR
XORDir
M
Wr
Dir
M
Dir
M
Rd Rd Wr
XOR
Mem MemMem
Home of Line X Home ofparity for Line X
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 7
parity for Line X
ReVive: Checkpoint Creation Timeline
Checkpoint (<1ms for 2MB L2) Execute (100 ms)
<5μs ~400μs/MB<1μs ~20μsP0
P1
TimeP2
P3
Write-BackSave CPU
Timer Interrupt Sync
P3
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 8
Context
Logging in HW
RdExcl Li X
Data
Dir
Line XNote:Wr Log also updates the parity
Mem
Wr LogRd Line X
Home of Line X
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 9
Log Filtering
Add L bit to directory entry of each liney y– Clear all L bits on each checkpoint
Set when logged– Set when logged
– Do not log if already set
Not needed for correctness– Can be only in directory cache
– Can be completely omitted
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 10
Classes of Recoverable Errors
CPU
Caches
Dir CtrlRevive Hardware
Mem Ctrl
DRAM
Net Ctrl InterconnectionNetwork
DRAM
(Trans + perm) errors in 1 nodeCan recover fromModule
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 11
Trans errors in N nodes( p )
Permanent Node Loss: Recovery
Unavailable (~840ms) Degraded
Detection Repair Log Repair DataExecuteP0
~100ms ~490ms ~20s80ms100msP1
Time
Rollback
P2
P3
Time
B !
HWRollback
Checkpoint
P3
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 12
Bzzzt!
Evaluation Setup
Splash-2 benchmarksp
16 superscalar processors (6-issue at 1GHz)
16kB L1 cache, 512kB L2 cache
2-D torus network, virtual cut-through routing, g g
100MHz DDR SDRAM
Using 7+1 distributed parity
Checkpoint interval: 10ms and infinite
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 13
Performance Overhead
20%
25% L2 Misses/1000 Instructions
5
15%
20%
me
Incr
ease Cp10ms 7+1
CpInf 7+1
Cp10ms 1+1
56
9Ckp+Log
5%
10%
Exec
. Tim Cp 0 s
CpInf 1+1
Par
0%
Barnes
holes
ky FFTFMM LUOce
anad
iosity
Radix
aytra
ceVolr
end
ater-N
2ate
r-Sp
Averag
e
Par
B Cho O
Rad Ray VoW
at Wat Ave
Tolerable 6% performance overhead
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 14
Worst-Case Recovery Time
60
70
30
40
50
seco
nds Reconstruct (Log)
Reconstruct (Mem)
Undo
10
20
30
Mill
is Undo
0
Barnes
Cholesky FFT
FMM LUOce
anRad
iosity
Radix
Raytra
ceVolre
ndWate
r-N2
Water-S
pAve
rage
Radix: 590ms + 180ms + 50ms = 820ms99 999% il bili
C R R W W
Redo Work HW Repair
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 15
⇒ 99.999% availability
Network Traffic
Bytes/Instr
1.2
PARCkp WBExe WB
0.6
Exe WBRD/RDX
0
Barn
es
oles
ky
FFT
FMM LU
Oce
an
dios
ity
Rad
ix
ytra
ce
Volre
nd
ter-
N2
ter-
Sp
Mea
n
B
Cho O
Rad Ray V
Wat
Wat H
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 16
Memory Traffic
Bytes/Instr
2.0 PARLOGCkp WB
1.0
Ckp WBExe WBRD/RDX
0
Barn
es
oles
ky
FFT
FMM LU
Oce
an
dios
ity
Rad
ix
aytra
ce
Volre
nd
ater
-N2
ater
-Sp
H M
ean
B
Ch
Ra Ra V
Wa
Wa H
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 17
Related Work
Device- or problem-specific schemesp p– DIVA, Redundant Multithreading, Slipstream, ECC, etc.
ReVive can handle errors that escape these schemes– ReVive can handle errors that escape these schemes,improving overall availability at low additional cost
Other system recovery schemesOther system-recovery schemes– Plank et al. - N+1 parity in software
– Masubuchi et al. - logging with bus-snooper
– SafetyNet
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 18
Related Work: SafetyNet
Types of recoverable errors– ReVive: Permanent (loss of a node)+Transient
– SafetyNet: Transient; perm only w/ redundant devices
HW modifications– ReVive: Directory controller onlyReVive: Directory controller only
– SafetyNet: Memory, caches, coherence protocol
Performance Overhead– 6% with ReVive, negligible with SafetyNet
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 19
Conclusions
Recovery from: system-wide transients loss of 1 nodeRecovery from: system wide transients, loss of 1 node
Availability better than 99.999%
Low performance overhead (6% on average)
HW changes only to directory controllersg y y
Memory overhead 12.5% with 7+1 parity
O h d b d d b i i i– Overhead can be reduced by increasing parity groups
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 20
ReVive:Cost-Effective Architectural Support for
Rollback Recoveryyin Shared-Memory Multiprocessors
Milos Prvulovic, Zheng Zhang, Josep Torrellas
http://[email protected] v v @cs c d
Rollback Recovery in Multiprocessors
Checkpoint Consistencyp y– Global, Local Coordinated or Local Uncoordinated
Ch k i S iCheckpoint Separation– Full or Partial
– Partial can be with Logging, Renaming or Buffering
Checkpoint Storagep g– Safe External, Safe Internal or for a Specialized Error Class
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 22
Checkpoint Consistency
GlobalSynchronization is fast enough on h d hiGlobal
– All synchronize to make a single consistent checkpoint
shared-memory machines
Local Coordinated– Synchronize as needed for a set of consistent checkpointsy p
Local UncoordinatedD t h i– Do not synchronize
– Set of consistent checkpoints computed when recovering
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 23
Checkpoint Storage
Safe E ternal (e g RAID) Not fast eno ghSafe External (e.g. RAID)– Recovery data on redundancy protected-disk
Not fast enough
Safe Internal (e.g. DRAM)– Recovery data in redundancy-protected memoryy y p y
Unsafe InternalR d t t t t d b d d
Not general enough
– Recovery data not protected by redundancy
– Assumes memory content survives errors
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 24
Checkpoint Separation
Full Too much storage needed
– Checkpoint and working data sets do not intersect
Partial with Buffering Commit atomicity, overheadg– Buffer non-checkpoint data, flush to commit
Partial with Renaming
y, d
Complex HW or coarse grainPartial with Renaming– Rename to avoid overwriting checkpoint data
P i l i h L i
Complex HW or coarse grain
Partial with Logging– Save overwritten checkpoint data in a log
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 25
Log & Parity Update Races
Error while log update in progress– Must fully perform log update before starting overwrite
Error while parity update in progresso w p y pd p og ss– Assume a single node fails
C r r ith r ld r t t– Can recover either old or new content
– Both result in consistent recovery (see paper)
Long error detection latency– Keep sufficient logs to recover far enough into the past
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 26
Availability vs Overhead
If checkpoint interval too shortp– Lost work and hardware self-check dominate recovery
Fault free execution performance suffers– Fault-free execution performance suffers
If checkpoint interval too long– Low availability
Find a good balanceg– Checkpoint intervals of 100ms to 1s
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 27
Analysis
Cache size vs. checkpoint intervalp– 512kB caches with checkpoints every 10ms
5MB caches with checkpoints every 100ms– 5MB caches with checkpoints every 100ms
Log size vs. checkpoint interval– Log will grow in sub-linear proportion to interval size
– 10ms: <3MB per node, only two apps >128kB per node
Parity overhead: 12.5% of system memory is parity
Prvulovic et al. ReVive: Cost Effective Rollback Recovery 28