Cost-Effective Architectural Support for Rollback Recovery ...

ReVive:Cost-Effective Architectural Support for

Rollback Recoveryyin Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang*, Josep Torrellas

University of Illinois at Urbana-Champaign*Hewlett-Packard Laboratoriesw c d b s

Motivation

Availability & Reliability increasingly important

Frequency ↑, Feature Size ↓ ⇒ Errors ↑

Complexity ↑ Verification Cost ↑ ⇒ Errors ↑Complexity ↑, Verification Cost ↑ ⇒ Errors ↑

Multiprocessors ⇒ Errors ↑

Global software-only recovery too slow

Can hardware help?Can hardware help?

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 2

Motivation

Cost vs. Performance vs. Availability

Low Cost– Simple changes to a few key componentsSimple changes to a few key components

Low Performance Overhead– Handle frequent operations in hardware

High Availability– Fast recovery from a wide class of errors


Contribution: New Scheme

Low Cost– HW changes only to directory controllers

– Memory overhead only 12.5% (with 7+1 parity)

Low Performance Overhead– Only 6% performance overhead on average– Only 6% performance overhead on average

High Availability– Recovery from: system-wide transients, loss of one node

– Availability better than 99.999% (assuming 1 error/ day)


Overview of ReVive

Entire main memory protected by distributed parity– Like RAID-5, but in memory

Periodically establish a checkpointod c y s b s c c po– Main memory is the checkpoint state

Writ b k dirt d t fr h pr r t t– Write-back dirty data from caches, save processor context

Save overwritten data to enable restoring checkpoint– When program execution modifies memory for 1st time


Distributed N+1 ParityNode 0 Node 1 Node N

Parity Data DataParity Group

. . .Distributed tominimize contention

Allocation Granularity: page

dUpdate Granularity: cache line


Distributed Parity Update in HW

WB

ParPar Ack

Di

WB Line X

DiDi OR

XORDir

M

Wr

Dir

M

Dir

M

Rd Rd Wr

XOR

Mem MemMem

Home of Line X Home ofparity for Line X


parity for Line X

ReVive: Checkpoint Creation Timeline

Checkpoint (<1ms for 2MB L2) Execute (100 ms)

<5μs ~400μs/MB<1μs ~20μsP0

P1

TimeP2

P3

Write-BackSave CPU

Timer Interrupt Sync

P3


Context

Logging in HW

RdExcl Li X

Data

Dir

Line XNote:Wr Log also updates the parity

Mem

Wr LogRd Line X

Home of Line X


Log Filtering

Add L bit to directory entry of each liney y– Clear all L bits on each checkpoint

Set when logged– Set when logged

– Do not log if already set

Not needed for correctness– Can be only in directory cache

– Can be completely omitted


Classes of Recoverable Errors

CPU

Caches

Dir CtrlRevive Hardware

Mem Ctrl

DRAM

Net Ctrl InterconnectionNetwork

DRAM

(Trans + perm) errors in 1 nodeCan recover fromModule


Trans errors in N nodes( p )

Permanent Node Loss: Recovery

Unavailable (~840ms) Degraded

Detection Repair Log Repair DataExecuteP0

~100ms ~490ms ~20s80ms100msP1

Time

Rollback

P2

P3

Time

B !

HWRollback

Checkpoint

P3


Bzzzt!

Evaluation Setup

Splash-2 benchmarksp

16 superscalar processors (6-issue at 1GHz)

16kB L1 cache, 512kB L2 cache

2-D torus network, virtual cut-through routing, g g

100MHz DDR SDRAM

Using 7+1 distributed parity

Checkpoint interval: 10ms and infinite


Performance Overhead

20%

25% L2 Misses/1000 Instructions

5

15%

20%

me

Incr

ease Cp10ms 7+1

CpInf 7+1

Cp10ms 1+1

56

9Ckp+Log

5%

10%

Exec

. Tim Cp 0 s

CpInf 1+1

Par

0%

Barnes

holes

ky FFTFMM LUOce

anad

iosity

Radix

aytra

ceVolr

end

ater-N

2ate

r-Sp

Averag

e

Par

B Cho O

Rad Ray VoW

at Wat Ave

Tolerable 6% performance overhead


Worst-Case Recovery Time

60

70

30

40

50

seco

nds Reconstruct (Log)

Reconstruct (Mem)

Undo

10

20

30

Mill

is Undo

0

Barnes

Cholesky FFT

FMM LUOce

anRad

iosity

Radix

Raytra

ceVolre

ndWate

r-N2

Water-S

pAve

rage

Radix: 590ms + 180ms + 50ms = 820ms99 999% il bili

C R R W W

Redo Work HW Repair


⇒ 99.999% availability

Network Traffic

Bytes/Instr

1.2

PARCkp WBExe WB

0.6

Exe WBRD/RDX

0

Barn

es

oles

ky

FFT

FMM LU

Oce

an

dios

ity

Rad

ix

ytra

ce

Volre

nd

ter-

N2

ter-

Sp

Mea

n

B

Cho O

Rad Ray V

Wat

Wat H


Memory Traffic

Bytes/Instr

2.0 PARLOGCkp WB

1.0

Ckp WBExe WBRD/RDX

0

Barn

es

oles

ky

FFT

FMM LU

Oce

an

dios

ity

Rad

ix

aytra

ce

Volre

nd

ater

-N2

ater

-Sp

H M

ean

B

Ch

Ra Ra V

Wa

Wa H


Related Work

Device- or problem-specific schemesp p– DIVA, Redundant Multithreading, Slipstream, ECC, etc.

ReVive can handle errors that escape these schemes– ReVive can handle errors that escape these schemes,improving overall availability at low additional cost

Other system recovery schemesOther system-recovery schemes– Plank et al. - N+1 parity in software

– Masubuchi et al. - logging with bus-snooper

– SafetyNet


Related Work: SafetyNet

Types of recoverable errors– ReVive: Permanent (loss of a node)+Transient

– SafetyNet: Transient; perm only w/ redundant devices

HW modifications– ReVive: Directory controller onlyReVive: Directory controller only

– SafetyNet: Memory, caches, coherence protocol

Performance Overhead– 6% with ReVive, negligible with SafetyNet


Conclusions

Recovery from: system-wide transients loss of 1 nodeRecovery from: system wide transients, loss of 1 node

Availability better than 99.999%

Low performance overhead (6% on average)

HW changes only to directory controllersg y y

Memory overhead 12.5% with 7+1 parity

O h d b d d b i i i– Overhead can be reduced by increasing parity groups


ReVive:Cost-Effective Architectural Support for

Rollback Recoveryyin Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang, Josep Torrellas

http://[email protected] v v @cs c d

Rollback Recovery in Multiprocessors

Checkpoint Consistencyp y– Global, Local Coordinated or Local Uncoordinated

Ch k i S iCheckpoint Separation– Full or Partial

– Partial can be with Logging, Renaming or Buffering

Checkpoint Storagep g– Safe External, Safe Internal or for a Specialized Error Class


Checkpoint Consistency

GlobalSynchronization is fast enough on h d hiGlobal

– All synchronize to make a single consistent checkpoint

shared-memory machines

Local Coordinated– Synchronize as needed for a set of consistent checkpointsy p

Local UncoordinatedD t h i– Do not synchronize

– Set of consistent checkpoints computed when recovering


Checkpoint Storage

Safe E ternal (e g RAID) Not fast eno ghSafe External (e.g. RAID)– Recovery data on redundancy protected-disk

Not fast enough

Safe Internal (e.g. DRAM)– Recovery data in redundancy-protected memoryy y p y

Unsafe InternalR d t t t t d b d d

Not general enough

– Recovery data not protected by redundancy

– Assumes memory content survives errors


Checkpoint Separation

Full Too much storage needed

– Checkpoint and working data sets do not intersect

Partial with Buffering Commit atomicity, overheadg– Buffer non-checkpoint data, flush to commit

Partial with Renaming

y, d

Complex HW or coarse grainPartial with Renaming– Rename to avoid overwriting checkpoint data

P i l i h L i

Complex HW or coarse grain

Partial with Logging– Save overwritten checkpoint data in a log


Log & Parity Update Races

Error while log update in progress– Must fully perform log update before starting overwrite

Error while parity update in progresso w p y pd p og ss– Assume a single node fails

C r r ith r ld r t t– Can recover either old or new content

– Both result in consistent recovery (see paper)

Long error detection latency– Keep sufficient logs to recover far enough into the past


Availability vs Overhead

If checkpoint interval too shortp– Lost work and hardware self-check dominate recovery

Fault free execution performance suffers– Fault-free execution performance suffers

If checkpoint interval too long– Low availability

Find a good balanceg– Checkpoint intervals of 100ms to 1s


Analysis

Cache size vs. checkpoint intervalp– 512kB caches with checkpoints every 10ms

5MB caches with checkpoints every 100ms– 5MB caches with checkpoints every 100ms

Log size vs. checkpoint interval– Log will grow in sub-linear proportion to interval size

– 10ms: <3MB per node, only two apps >128kB per node

Parity overhead: 12.5% of system memory is parity


Cost-Effective Architectural Support for Rollback Recovery ...

Documents

Transcript of Cost-Effective Architectural Support for Rollback Recovery ...